Data Quality - What, Why, How, 10 Best Practices & More
As data is becoming a core part of every business operation, the quality of the data that is gathered, stored and consumed during business processes will determine the success achieved in doing business today and tomorrow.
Download your copy of the guide to keep in your back pocket. Or if you’re ready to dive in, continue your journey below.
In this article we will:
What is data quality?
Regard data as the foundation for a hierarchy where data is the bottom level. On top of data, you have information, being data in context. Further up, we have knowledge seen as actionable information and on the top level, wisdom as the applied knowledge.
If you have bad data quality, your information quality suffers. With bad information quality, you will lack actionable knowledge in business operations and be unable to apply that knowledge or do it incorrectly resulting in risky business outcomes.
There are several definitions of data quality. The two predominant ones are:
- Data is of high quality if the data is fit for the intended use or purpose.
- Data is of high quality if the data correctly represents the real-world construct it describes.
These two possible definitions may contradict each other. If for example, a customer master data record is fit for issuing an invoice at receiving payment, it may be fit for that purpose. But if the customer master data record at the same time is incomplete or incorrect for doing customer service, because the data does not fully or incorrectly describe the who, what and where of the real-world entity having the customer role in that business operation, we have a business problem.
Not at least master data must often be fit for multiple purposes. You can achieve that by ensuring the real-world alignment. On the other hand, it might not be profitable and proportionate to strive for the prefect real-world alignment in order to have data fit for the intended purpose of use within the business objective where a data quality initiative is funded. Thus, in practice, it is about striking a balance between these two definitions.
In a research commissioned by Experian Data Quality in 2013 the top reason for data inaccuracy was found to be human errors, with 59 % of cases assessed to be stemming from that cause. Avoiding or eventually correcting low quality data caused by human errors requires a comprehensive effort with the right mix of remedies being about people, processes and technology.
Other top reasons for data inaccuracy found in the mentioned research are lack of communication between departments (31%) and inadequate data strategy (24%). Solving such issues calls for an passionate top-level management involvement.
Importance of data quality
Usually it is not hard to get everyone in a business, including the top-level management, to agree about that having good data quality is good for business. In the current era of digital transformation, the support for focussing on data quality is even better than it was before.
However, when it comes to the essential questions about who is responsible for data quality, who must do something about it and who will fund the necessary activities, then the going gets tough.
Data quality resembles human health. Accurately testing how any one element of our diet and exercising may affect our health is fiendishly difficult. In the same way, accurately testing how any one element of our data may affect our business is fiendishly difficult too.
Nevertheless, numerous experiences tell us that bad data quality is not very healthy for business.
The classic examples are:
- In marketing you overspend, and annoy your prospects, by sending the same material more than once to the same person – with the name and address spelled a bit different. The problem here is duplicates within the same database and across several internal and external sources.
- In online sales you cannot present sufficient product data to support a self-service buying decision. The issues here are completeness of product data within your databases and how product data is syndicated between trading partners.
- In supply chain you cannot automate processes based on reliable location information. The challenges here are using the same standards and having the necessary precision within the location data.
- In financial reporting you get different answers for the same question. This is due to inconsistent data, varying freshness of data and unclear data definitions.
On a corporate level, data quality issues have a drastic impact on meeting core business objectives, as:
- Inability to timely react to new market opportunities and thus hindering profit and growth achievements. Often this is due to not being ready for repurposing existing data that were only fit for yesterday’s requirements.
- Obstacles in implementing cost reduction programs, as the data that must support the ongoing business processes needs too much manual inspection and correction. Automation will only work on complete and consistent data.
- Shortcomings in meeting increasing compliance requirements. These requirements span from privacy and data protection regulations as GDPR, health and safety requirements in various industries to financial restrictions, requirements and guidelines. Better data quality is most times a must in order to meet those compliance objectives.
- Difficulties in exploiting predictive analysis on corporate data assets resulting in more risk than necessary when making both short-term and long-term decisions. These challenges stems from issues around duplication of data, data incompleteness, data inconsistency and data inaccuracy.
How to Improve Data Quality
Improving data quality takes a balanced mix of medicine encompassing people, processes and technology as well as a good portion of top-level management involvement.
Data Quality Dimensions
When improving data quality, the aim will be to measure and improve a range of data quality dimensions.
Uniqueness is the most addressed data quality dimension when it comes to customer master data. Customer master data are often marred by duplicates, meaning two or more database rows describing the same real-world entity. There are several remedies around to cure that pain going from intercepting the duplicates at the onboarding point to bulk deduplication of records already stored in one or several databases.
With product master data, uniqueness is a less frequent issue. However, completeness is often a big pain. One reason is that completeness means different requirements for different categories of products.
When working with location master data consistency can be a challenge. Addressing, so to speak, the different postal address formats around the world is certainly not a walkover.
In the intersection between the location domain and the customer domain the data quality dimension called precision can be hard to manage, as different use cases require different precision for a location weather being a postal address and/or a geographic position.
What is relevant to know about your customers and what is relevant to tell about your products are essential questions in the intersection of the customer and product master data domains.
Conformity of product data is related to locations. Take unit measurement. In the United States the length of a small thing will be in inches. In most of the rest of the world it will be in centimetres. In the UK you will never know.
Timeliness, meaning if the data is available at the time needed, is the everlasting data quality dimension all over.
Other data quality dimensions to measure and improve are data accuracy, being about the real-world alignment or alignment with a verifiable source, data validity, being about if data is within the specified business requirements, and data integrity, being about the if the relations between entities and attributes are technically consistent.
Data Quality Management
In data quality management the goal is to exploit a balanced set of remedies in order to prevent future data quality issues and to cleanse (or ultimately purge) data that does not meet the data quality Key Performance Indicators (KPIs) needed to achieve the business objectives of today and tomorrow.
The data quality KPIs will typically be measured on the core business data assets within the data quality dimensions as data uniqueness, data completeness, data consistency, data conformity, data precision, data relevance, data timeliness, data accuracy, data validity and data integrity.
The data quality KPIs must relate to the KPIs used to measure the business performance in general.
The remedies used to prevent data quality issues and eventual data cleansing includes these disciplines:
- Data Governance
- Data Profiling
- Data Matching
- Data Quality Reporting
- Master Data Management (MDM)
- Customer Data Integration (CDI)
- Product Information Management (PIM)
- Digital Asset Management (DAM)
A data governance framework must lay out the data policies and data standards that sets the bar for what data quality KPIs that is needed and which data elements that should be addressed. This includes what business rules that must be adhered to and underpinned by data quality measures.
Furthermore, the data governance framework must encompass the organizational structures needed to achieve the required level of data quality. This includes fora as a data governance committee or similar, roles as data owners, data stewards, data custodians or similar in balance with what makes sense in a given organization.
A business glossary is another valuable outcome from data governance used in data quality management. The business glossary is a primer to establish the metadata used to achieve common data definitions within an organization and eventually in the business ecosystem where the organization operates.
It is essential that the people who are appointed to be responsible for data quality and those who are tasked with preventing data quality issues and data cleansing have a deep understanding of the data at hand.
Data profiling is a method, often supported by dedicated technology, used to understand the data assets involved in data quality management. These data assets have most often been populated over the years by different people operating under varying business rules and gathered for bespoke business objectives.
In data profiling the frequency and distribution of data values is counted on relevant structural levels. Data profiling can also be used to discover the keys that relate data entities across different databases and in the degree that this is not already done within the single databases.
Data profiling can be used to directly measure data integrity and can be used as input to set up the measurement of other data quality dimensions.
When it comes to real-world alignment using exact keys in databases is not enough.
The classic example is how we spell the name of a person differently due to misunderstandings, typos, use of nicknames and more. With company names the issues just piles up with funny mnemonics and inclusion of legal forms. When we place these persons and organizations at locations using a postal address the ways of writing that has numerous outcomes too.
Data matching is a technology based on match codes, as for example soundex, fuzzy logic and increasingly also machine learning used to determine if two or more data records are describing the same real-world entity (typically a person, a household or an organization).
This method can be used in deduplicating a single database and finding matching entities across several data sources.
Often data matching is based on data parsing, where names, addresses and other data elements are split into discrete data elements as for example an envelope type address is split into building name, unit, house number, street, postal code, city, state/province and country. This may be supplemented by data standardization for example using the same value for street, str and st.
Data Quality Reporting
The findings from data profiling can be used as input to measure data quality KPIs based on the data quality dimensions relevant to a given organization. The findings from data matching are especially useful for measuring data uniqueness.
In addition to that it is helpful to operate a data quality issue log, where known data quality issues are documented, and the preventive and data cleansing activities are followed up.
Organizations focussing on data quality find it useful to operate a data quality dashboard highlighting the data quality KPIs and the trend in their measurements as well as the trend in issues going through the data quality issue log.
Master Data Management (MDM)
The most, and the most difficult, data quality issues are related to master data as party master data (customer roles, supplier roles, employee roles and more), product master data and location master data.
Preventing data quality issues in a sustainable way and not being forced to launch data cleansing activities over and again will for most organizations mean that an MDM framework must be in place.
Master Data Management and Data Quality Management (DQM) are tightly coupled disciplines. MDM and DQM will be a part of the same data governance framework and share the same roles as data owners, data stewards and data custodians. Data profiling activities will most often be done with master data assets. When doing data matching the results must be kept in master data assets controlling the merged and purged records and the survivorship of data attributes relating to those records.
Customer Data Integration (CDI)
Not at least customer master data are in many organizations sourced from a range of applications. These are self-service registration sites, Customer Relationship Management (CRM) applications, ERP applications, customer service applications and perhaps many more.
Besides setting up the technical platform for compiling the customer master data from these sources into one source of truth there is a huge effort in ensuring the data quality of that source of truth. This involves data matching and a sustainable way of ensuring the right data completeness, the best data consistency and the adequate data accuracy.
Product Information Management (PIM)
As a manufacturer of goods, you need to align your internal data quality KPIs with those of your distributors and merchants in order to make your products the ones that will be chosen by end customers where ever they have a touchpoint in the supply chain. This must be done by ensuring the data completeness and other data quality dimensions within the product data syndication processes.
As a merchant of goods, you will collect product information from many suppliers with each having their data quality KPIs (or not having that yet). Merchants must therefore work closely with their suppliers and strive to have a uniform way of receiving product data in the best quality according to the data quality KPIs at the merchant side.
Digital Asset Management (DAM)
Digital assets are images, text documents, videos and other files often used in conjunction with product data. In the data quality lens, the challenges for this kind of data is around correct at relevant tagging (metadata) as well as quality of the assets as such as for example if a product image shows only the product clearly and not a lot of other things too.
Data Quality Best Practices
In the following we will, based on the reasoning provided above in this post, list a collection of 10 highly important data quality best practices. These are:
- Ensure top-level management involvement. Quite a lot of data quality issues are only solved by having a cross departmental view.
- Manage data quality activities as a part of a data governance framework. This framework should set the data policies and data standards, the roles needed and provide a business glossary.
- Occupy roles as data owners and data stewards from the business side of the organization and occupy data custodian roles from business or IT where it makes most sense.
- Use a business glossary as the foundation for metadata management. Metadata is data about data and metadata management must be used to have common data definitions and link those to current and future business applications.
- Operate a data quality issue log with an entry for each issue with information about the assigned data owner and the involved data steward(s), the impact of the issue, the resolution and the timing of the necessary proceedings.
- For each data quality issue raised, start with a root cause analysis. The data quality problems will only go away, if the solution addresses the root cause.
- When finding solutions strive to implement processes and technology that prevents the issues from occurring as close to the data onboarding point as possible rather than relying on downstream data cleansing.
- Define data quality KPIs that are linked to the general KPIs for business performance. Data quality KPIs, sometimes also called Data Quality Indicators (DQIs), can be related to data quality dimensions as for example data uniqueness, data completeness and data consistency.
- Use anecdotes about data quality train wrecks to get awareness around the importance of data quality. However, use fact-based impact and risk analysis to justify the solutions and the needed funding.
- Today a lot of data is already digitalized. Therefore, avoid typing in data where possible. Instead, try to find cost effective solutions for data onboarding that utilizes third party data sources for publicly available data as for example with locations in general and names, addresses and IDs for companies and some cases individual persons. For product data utilize second party data from trading partners where possible.
Data Quality Resources
There are many resources out here where you can learn more about data quality. Please find below a list of some of the resources that may be very useful when framing a data quality strategy and addressing specific data quality issues:
- Profisee Data Quality – What, Why, How eBook
- Larry P. English is the father of data and information quality management. His thoughts are still available here: https://www.information-management.com/author/larry-english-im30029
- Thomas C. Redman, aka the Data Doc, writes about data quality and data in general on Howard Business Review. His articles are found here: https://hbr.org/search?term=thomas%20c.%20redman
- David Loshin has made a book with the title The Practitioners’ Guide to Data Quality Improvement http://dataqualitybook.com/?page_id=2
- Gartner, the analyst firm, has a glossary with definitions of data quality terms here: https://www.gartner.com/it-glossary/?s=data+quality
- Massachusetts Institute of Technology (MIT) has a Total Data Management Program (TDQM) http://web.mit.edu/tdqm/www/index.shtml
- Knowledgent, a part of Accenture, provides a white paper on Data Quality Management here: https://knowledgent.com/whitepaper/building-successful-data-quality-management-program/
- Deloitte has published a case study called data quality driven, customer insights enabled: https://www2.deloitte.com/us/en/pages/deloitte-analytics/articles/data-quality-driven-customer-insights-enabled.html
- An article on bi-survey examines why data quality is essential in Business Intelligence https://bi-survey.com/data-quality-master-data-management
- The University of Leipzig has a page on data matching in big data environments (they call it dedoop) https://dbs.uni-leipzig.de/dedoop
- A Toolbox article by Steve Jones goes through How to Achieve Quality Data in a Big Data context https://it.toolbox.com/blogs/stevejones/how-to-achieve-quality-data-111618
- An Information Week article points to 8 Ways To Ensure Data Quality https://www.informationweek.com/big-data/big-data-analytics/8-ways-to-ensure-data-quality/d/d-id/1322239?image_number=1
- Data Quality Pro is a site, manged by Dylan Jones, with a lot of information about data quality: https://www.dataqualitypro.com/
- Obsessive-Compulsive Data Quality (OCDQ) by Jim Harris is an inspiring blog about data quality and its related disciplines http://www.ocdqblog.com/
- Nicola Askham runs a blog about data governance: https://www.nicolaaskham.com/blog One of the posts in this blog is about what to include in a data quality issue log: https://www.nicolaaskham.com/blog/2018-21-02what-do-you-include-in-data-quality-issue-log
- Henrik Liliendahl have a long-time running blog with over 1,000 blog posts about data quality and Master Data Management: https://liliendahl.com/
- A blog called Viqtor Davis Data Craftmanship provides some useful insights on data management: https://www.viqtordavis.com/blog/