As data is becoming a core part of every business operation the quality of the data that is gathered, stored and consumed during business processes will determine the success achieved in doing business today and tomorrow.
In this article we will:
You can regard data as the foundation for a hierarchy where data is the bottom level. On top of data you have information, being data in context. Further up we have knowledge seen at actionable information and on top level wisdom as the applied knowledge.
If you have bad data quality, you will not have good information quality. With bad information quality you will lack actionable knowledge in business operations and not be able to apply that knowledge or do that wrongly with risky business outcomes as the result.
There are many definitions of data quality. The two predominate ones are:
These two possible definitions may contradict each other. If for example a customer master data record is fit for issuing an invoice at receiving a payment it may be fit for that purpose. But if the customer master data record at the same time is incomplete or incorrect for doing customer service, because the data does not fully or incorrectly describe the who, what and where of the real-world entity having the customer role in that business operation, we have a business problem.
Not at least master data must often be fit for multiple purposes. You can achieve that by ensuring the real-world alignment. On the other hand, it might not be profitable and proportionate to strive for the prefect real-world alignment in order to have data fit for the intended purpose of use within the business objective where a data quality initiative is funded. Thus, in practice, it is about striking a balance between these two definitions.
In a research commissioned by Experian Data Quality in 2013 the top reason for data inaccuracy was found to be human errors, with 59 % of cases assessed to be stemming from that cause. Avoiding or eventually correcting low quality data caused by human errors requires a comprehensive effort with the right mix of remedies being about people, processes and technology.
Other top reasons for data inaccuracy found in the mentioned research are lack of communication between departments (31%) and inadequate data strategy (24%). Solving such issues calls for an passionate top-level management involvement.
Usually it is not hard to get everyone in a business, including the top-level management, to agree about that having good data quality is good for business. In the current era of digital transformation, the support for focussing on data quality is even better than it was before.
However, when it comes to the essential questions about who is responsible for data quality, who must do something about it and who will fund the necessary activities, then the going gets tough.
Data quality resembles human health. Accurately testing how any one element of our diet and exercising may affect our health is fiendishly difficult. In the same way, accurately testing how any one element of our data may affect our business is fiendishly difficult too.
Nevertheless, numerous experiences tell us that bad data quality is not very healthy for business.
The classic examples are:
On a corporate level, data quality issues have a drastic impact on meeting core business objectives, as:
Improving data quality takes a balanced mix of medicine encompassing people, processes and technology as well as a good portion of top-level management involvement.
When improving data quality, the aim will be to measure and improve a range of data quality dimensions.
Uniqueness is the most addressed data quality dimension when it comes to customer master data. Customer master data are often marred by duplicates, meaning two or more database rows describing the same real-world entity. There are several remedies around to cure that pain going from intercepting the duplicates at the onboarding point to bulk deduplication of records already stored in one or several databases.
With product master data, uniqueness is a less frequent issue. However, completeness is often a big pain. One reason is that completeness means different requirements for different categories of products.
When working with location master data consistency can be a challenge. Addressing, so to speak, the different postal address formats around the world is certainly not a walkover.
In the intersection between the location domain and the customer domain the data quality dimension called precision can be hard to manage, as different use cases require different precision for a location weather being a postal address and/or a geographic position.
What is relevant to know about your customers and what is relevant to tell about your products are essential questions in the intersection of the customer and product master data domains.
Conformity of product data is related to locations. Take unit measurement. In the United States the length of a small thing will be in inches. In most of the rest of the world it will be in centimetres. In the UK you will never know.
Timeliness, meaning if the data is available at the time needed, is the everlasting data quality dimension all over.
Other data quality dimensions to measure and improve are data accuracy, being about the real-world alignment or alignment with a verifiable source, data validity, being about if data is within the specified business requirements, and data integrity, being about the if the relations between entities and attributes are technically consistent.
In data quality management the goal is to exploit a balanced set of remedies in order to prevent future data quality issues and to cleanse (or ultimately purge) data that does not meet the data quality Key Performance Indicators (KPIs) needed to achieve the business objectives of today and tomorrow.
The data quality KPIs will typically be measured on the core business data assets within the data quality dimensions as data uniqueness, data completeness, data consistency, data conformity, data precision, data relevance, data timeliness, data accuracy, data validity and data integrity.
The data quality KPIs must relate to the KPIs used to measure the business performance in general.
The remedies used to prevent data quality issues and eventual data cleansing includes these disciplines:
A data governance framework must lay out the data policies and data standards that sets the bar for what data quality KPIs that is needed and which data elements that should be addressed. This includes what business rules that must be adhered to and underpinned by data quality measures.
Furthermore, the data governance framework must encompass the organizational structures needed to achieve the required level of data quality. This includes fora as a data governance committee or similar, roles as data owners, data stewards, data custodians or similar in balance with what makes sense in a given organization.
A business glossary is another valuable outcome from data governance used in data quality management. The business glossary is a primer to establish the metadata used to achieve common data definitions within an organization and eventually in the business ecosystem where the organization operates.
It is essential that the people who are appointed to be responsible for data quality and those who are tasked with preventing data quality issues and data cleansing have a deep understanding of the data at hand.
Data profiling is a method, often supported by dedicated technology, used to understand the data assets involved in data quality management. These data assets have most often been populated over the years by different people operating under varying business rules and gathered for bespoke business objectives.
In data profiling the frequency and distribution of data values is counted on relevant structural levels. Data profiling can also be used to discover the keys that relate data entities across different databases and in the degree that this is not already done within the single databases.
Data profiling can be used to directly measure data integrity and can be used as input to set up the measurement of other data quality dimensions.
When it comes to real-world alignment using exact keys in databases is not enough.
The classic example is how we spell the name of a person differently due to misunderstandings, typos, use of nicknames and more. With company names the issues just piles up with funny mnemonics and inclusion of legal forms. When we place these persons and organizations at locations using a postal address the ways of writing that has numerous outcomes too.
Data matching is a technology based on match codes, as for example soundex, fuzzy logic and increasingly also machine learning used to determine if two or more data records are describing the same real-world entity (typically a person, a household or an organization).
This method can be used in deduplicating a single database and finding matching entities across several data sources.
Often data matching is based on data parsing, where names, addresses and other data elements are split into discrete data elements as for example an envelope type address is split into building name, unit, house number, street, postal code, city, state/province and country. This may be supplemented by data standardization for example using the same value for street, str and st.
The findings from data profiling can be used as input to measure data quality KPIs based on the data quality dimensions relevant to a given organization. The findings from data matching are especially useful for measuring data uniqueness.
In addition to that it is helpful to operate a data quality issue log, where known data quality issues are documented, and the preventive and data cleansing activities are followed up.
Organizations focussing on data quality find it useful to operate a data quality dashboard highlighting the data quality KPIs and the trend in their measurements as well as the trend in issues going through the data quality issue log.
The most, and the most difficult, data quality issues are related to master data as party master data (customer roles, supplier roles, employee roles and more), product master data and location master data.
Preventing data quality issues in a sustainable way and not being forced to launch data cleansing activities over and again will for most organizations mean that an MDM framework must be in place.
Master Data Management and Data Quality Management (DQM) are tightly coupled disciplines. MDM and DQM will be a part of the same data governance framework and share the same roles as data owners, data stewards and data custodians. Data profiling activities will most often be done with master data assets. When doing data matching the results must be kept in master data assets controlling the merged and purged records and the survivorship of data attributes relating to those records.
Not at least customer master data are in many organizations sourced from a range of applications. These are self-service registration sites, Customer Relationship Management (CRM) applications, ERP applications, customer service applications and perhaps many more.
Besides setting up the technical platform for compiling the customer master data from these sources into one source of truth there is a huge effort in ensuring the data quality of that source of truth. This involves data matching and a sustainable way of ensuring the right data completeness, the best data consistency and the adequate data accuracy.
As a manufacturer of goods, you need to align your internal data quality KPIs with those of your distributors and merchants in order to make your products the ones that will be chosen by end customers where ever they have a touchpoint in the supply chain. This must be done by ensuring the data completeness and other data quality dimensions within the product data syndication processes.
As a merchant of goods, you will collect product information from many suppliers with each having their data quality KPIs (or not having that yet). Merchants must therefore work closely with their suppliers and strive to have a uniform way of receiving product data in the best quality according to the data quality KPIs at the merchant side.
Digital assets are images, text documents, videos and other files often used in conjunction with product data. In the data quality lens, the challenges for this kind of data is around correct at relevant tagging (metadata) as well as quality of the assets as such as for example if a product image shows only the product clearly and not a lot of other things too.
In the following we will, based on the reasoning provided above in this post, list a collection of 10 highly important data quality best practices. These are:
There are many resources out here where you can learn more about data quality. Please find below a list of some of the resources that may be very useful when framing a data quality strategy and addressing specific data quality issues:
"The Profisee MDM platform provides exactly what we are looking for."
―Slobodan R.Read the full review
"Very capable MDM platform with solid development toolkit and favorable TCO"
―Data & Analytics Architecture Manager in the ManufRead the full review
"Profisee really stood out with their attractive pricing model and implementation time compared to the competition."
―Project Manager in the Finance IndustryRead the full review
"Very affordable and user friendly. Great for modeling big data domains."
―User in Higher EducationRead the full review
"Great end-to-end product to make MDM easier for organizations."
―Internal ConsultantRead the full review
"Excellent vision and roadmap for the product."
―Senior Manager Business Intelligence in the ServicRead the full review
"The Profisee product is intuitive enough for us to implement our first domain in under six months."
―Manager of Data Architecture in the ManufacturingRead the full review
"The technology is well built and is a flexible/robust tool - powerful engine and has solid UI and exceptional workflows - and ability to customize."
―Vice President in the Manufacturing IndustryRead the full review
"The best thing about the software is the UI, it is very nice and clear. It is very easy to understand"
―Administrator in Computer SoftwareRead the full review