Master data management (MDM) begins with harnessing your data through a process called data matching. It’s perhaps one of the most crucial steps in any data transformation initiative, as it requires the organization to make decisions about what data gets saved, how that data is formatted and how the company will prevent duplicate records in the future.
In this article, we’ll explore data matching, discussing the different types of data matching, tools you can use for data matching, the benefits of data matching and how to perform data matching. Let’s get started!
Harvard Business Review: Data Readiness for the AI Revolution
What Is Data Matching?
Data matching is the process by which related records from across an organization are identified, standardized and merged. Because data can be formatted differently across source systems, data matching is critical to ensure that records are correct, up-to-date and standardized across all the different systems that work with that data.
According to the book Data Matching by Peter Christen, data matching is also known as:
- Data linkage
- Record linkage
- Entity resolution
- Object identification
- Field matching
These terms are all similar in that they describe the process by which data from more than one source is identified, cleansed and de-duplicated for consistent use.
What Is Data Matching Software?
Data matching software automates the process of data matching, greatly reducing the labor involved in matching and cleaning data for use. Key features of data matching software include:
- API connections or webhooks to integrate data from data sources
- A matching engine, usually either powered by machine learning (ML) or graph-based matching technology, as in the case of Profisee
- The ability to define survivorship rules to merge records for the greatest level of accuracy
- Workflow automation settings
- Data quality and enrichment capabilities
- A data stewardship UI for business users to monitor data quality and review records flagged for manual review
Data matching software can be offered as a standalone solution but is usually built into master data management (MDM) software solutions instead. MDM tools help companies manage the essential and relatively unchanging data that runs their business — master data.
What Is Customer Data Matching?
Customer data matching is the process whereby organizations pull customer data from each of their software tools that store information about customers to build golden records of customer data. Customer data matching ensures that the organization works from the most up-to-date and accurate master customer data across its data estate.
Example customer data that may get updated during the matching process includes:
- Contact name
- Customer address
- Business name
- Contact phone number
- Contact email address
Because these fields are manually entered by employees or by customers themselves, companies might pull different entries for any of these data points for a single customer.
Data matching pulls this information from the company’s CRM, marketing automation tool and ERP, for example, into a central location. There, the matching software can correct typos, fill empty fields, standardize address formatting and produce a single, golden record for use across the company.
What Types of Data Matching Are There?
While all data matching processes have the same intended outcome, how they get there varies by the type(s) of data matching they provide. Data matching processes can be split into two major types
Deterministic Matching
Also known as exact matching or deterministic linking, this is when the matching tool (or human) doing the matching will combine data sets based on fields that contain identical sets of characters.
Deterministic matching works when the data includes unique IDs such as social security numbers or customer ID numbers. If all the software platforms the company uses include a standard, unique customer identification number except for your shipping software, customer records can be matched, cleansed and de-duplicated across all source systems except your shipping software.
Probabilistic matching
Also known as fuzzy matching or probabilistic linking, this matching process combines data fields that do not make an exact match but rather perform above a predetermined probabilistic matching threshold. Fuzzy data matching requires data sets with several different data points to best calculate the probability of a match.
These tools may be set to match key identifying attributes over several data points, a percentage threshold of matching characters in a field or other custom parameters. The probability of a match is then determined based on the tool’s confidence and expressed as a number between 0 and 100.
Do You Need a Data Matching Tool?
Organizations can quickly acquire vast quantities of data within each source system, and those numbers increase with every lead, sale, appointment or shipment that is made. Importing, sorting and manually matching every record from each system and compiling that data into a single golden record would require thousands of hours, and the process would be prone to the same mistakes and typos as the original data.
Data matching tools automate the process of consolidating and cleansing data according to the company’s data needs. These tools can scan matches across millions of entries from every software database at the company, match the records that meet requirements and flag exceptions for review. This allows data analysts to manage data by exception, rather than by individual record.
Data Matching Examples
Depending on the industry you serve and the products you sell, your master data aspirations may differ. Location and customer data are two of the most common golden records that companies will keep.
Location Data Matching
Location data matching is used to standardize and complete records that describe locations. This can include locations of field offices, distributors, customers or sales prospects. While location data is slow to change, it can require quite a bit of standardization, especially if the data is pulled from spreadsheets or other manual entry entities.
In the example below, the Home and Primary locations are the same, although there are several differences in the ways the address and phone numbers are formatted. Data professionals will decide the best format for each of these fields when creating data governance policies, which will then be enforced during matching and standardization.
Name | Address | Phone | Fax | Zip Code |
---|---|---|---|---|
Home | 123 123rd street | (555) 987-5555 | (555) 987-5556 | 00011 |
Primary | 123 123rd St | 555.987.5555 | 555.987.5556 | 00011 |
Second | 56 Lincoln | (555) 678.5555 | (555) 678.5556 | 00012 |
Riverside | 476 Riverside | 555-444-5555 | 555-444-5556 | 01113 |
Another decision that the data professionals may want to consider is the naming conventions used for each of their locations. While “Home” and “Second” may work for a company with only two locations, naming the location by the street name may work better for an expanding enterprise. The name could be extracted from the address field and filled for all locations.
Customer Data Matching
Customer data matching is useful for getting a unified view of your customers, but it can be complicated by the use of PO boxes, differences between billing and shipping addresses, multiple email addresses and differences between how customer names are entered. Finding a single field that stays consistent across customer records from different parts of an enterprise is unlikely.
Once the customer data is matched and standardized, however, a unique customer ID can be assigned to each customer, and potential matches can be surfaced by the customer MDM software to prevent mismatches based on misspellings or abbreviations.
Name | Billing | Shipping | |
---|---|---|---|
Tony B. | 56 Lincoln St. Billings MT 00011 | 56 Lincoln St. Billings MT 00011 | Tony.boloney@gmail.com |
Tony Bologna | PO Box 8776 Nashville TN 37220 | 56 Lincoln St. Billings MT 00011 | Tony.boloney@gmail.com |
Anthony B. | PO Box 8776 Nashville TN 37220 | PO Box 8776 Nashville TN 37220 | Tony.boloney@gmail.com |
Cassandra Bologna | 123 123rd St. Billings MT 00011 | 123 123rd St. Billings MT 00011 | Sandra.t.bologna@gmail.com |
Data Matching Benefits
As long as you can use data matching software to assist you in the process, data matching can pay dividends in reducing costs and improving the quality of data analysis across the enterprise. Here are just some of the benefits of data matching.
Golden Record Data
Data matching is a crucial step in producing golden records — master data that companies rely on to run their businesses efficiently. A golden record of customer, product, location and employee data reduces company mistakes, increases the accuracy of forecasts and provides a baseline for innovative data uses like generative AI.
Reduced Database Size
De-duplication and elimination of incorrect, incomplete and inaccurate master data drastically reduces the volume of data to be maintained. A reduced database will speed compute processes on the database and save you money in the long term.
Reduced TCO for Data Storage
Cloud data storage has been a true game-changer for many companies that previously did not have the funds or physical space to manage their own data center. While cloud data storage providers offer flexibility and scalability for database storage, the costs can get out of hand quickly. But data matching and cleansing will reduce your overall database size, meaning the amount you pay for storing that data will also go down.
Faster and More Accurate Data Analysis
Accurate data analysis relies on accurate databases, and fast data analysis relies upon low ingestion times. When golden record data is accurate and de-duplicated, the analysis teams and tools don’t have to filter out exceptions due to inaccurate data or second-guess the analysis provided by business intelligence and forecasting tools.
Regulatory Compliance
Companies that employ data matching solutions have an easier time complying with federal, regional and local regulations because they have fewer records to parse for each data request. And because the data footprint shrinks with proper data cleansing and policies to prevent re-duplication, protecting and accessing individual records takes less time.
Decreased Security and Fraud Risk
Smaller databases are easier to maintain and secure, while duplicate or incomplete customer records can be a point of entry for bad actors. When your company maintains clean and accurate records, those records are easier to monitor for inconsistencies or outright breaches. Data matching solutions can also give time back to analysts to follow up on exceptions or unexpected outcomes from the data.
Data Matching Challenges
Every digital initiative brings challenges, but the right tools and processes can reduce the effects of challenges like incomplete records, exceptions and ongoing data cleansing.
Incomplete Records
Data matching requires correct or complete information in at least some of the fields to meet fuzzy matching thresholds. However, if incomplete or incorrect data exists, the organization may not have enough information to update the record.
Data matching tools that also provide data enrichment services through publicly accessible databases can lower the number of incomplete fields across the dataset.
Exceptions
Exceptions occur in data matching when a field or group of fields does not fit into one of the predefined categories. Exceptions are common with exact data matching, as any misspelling, typo or incomplete field could be considered an exception.
In probabilistic data matching, any field that does not meet the criteria or correctness threshold would be marked as an exception, which must then be reviewed by a data analyst. Depending on whether the analyst can surmise what information is missing or incorrect will determine whether the field is corrected, or the entry discarded.
Re-Duplication of Records in the Future
Companies must devise a process by which data from across the enterprise will be matched in the future to prevent the reintroduction of duplicate records — a process often referred to as “lookup before create.” The original data cleansing and matching process takes time and effort to complete, but data entry protocols and software with access to centralized and cleansed master data as reference materials prevent the need for additional periodic data matching rounds.
Data Matching Use Cases
The power of data matching lies in the company’s ability to put previously chaotic data to use to improve productivity and outcomes for customers. These three companies used Profisee MDM to create golden records.
Healthcare
Mass General Brigham, a Massachusetts-based healthcare organization, wanted to build a provider search system that would help customers find in-network healthcare providers based on several factors including specialty, location and insurance coverage. Mass General Brigham needed to match provider data from across several source systems, including spreadsheets, and cleanse the data of misspellings and abbreviations.
Mass General Brigham uses Profisee to match data across its systems and create golden records that are used to feed their provider search system.
Retailers
Rheem Manufacturing needed to combine the company’s air (HVAC and air conditioning) and water (water heaters) divisions. The move would consolidate customer, sales and supplier data across different global lines of business and expand upon the traditional specializations of the contract installers, allowing for cross- and up-sell opportunities that were not previously available.
Rheem uses Profisee’s intelligent fuzzy-matching features to consolidate and combine customer data from distributors and installers.
Franchises and Multi-Location Corporations
The YMCA of the USA is a non-profit organization with 763 associations, 2,700 branch locations, 20 million members and over 30,000 staff members. With so many locations and unique records, the YMCA needed to standardize their staff records to improve staff management and analysis that would improve operational efficiency.
The YMCA chose Profisee’s cloud MDM tool to master its staff member data, matching data across 29 different source systems. With the lessons they learned in building these golden records, YMCA of the USA will tackle branch and member data next.
Data Matching Techniques and Steps
Data matching can be done manually by combining and updating spreadsheets. Modern spreadsheet software eases the burden a bit through formatting rules and formulas, but the best option for most organizations will be to use an MDM tool with data stewardship functionality to guide business users through these steps and perform the actions, alerting the data team when there are records requiring manual review.
1. Data Integration
During this phase, data is integrated from source systems via API, webhook or file upload. This step is essential for breaking down data silos, as it entails collecting data from disconnected data sources and consolidating it into a single repository — the MDM tool.
2. Matching
This step can be split into two parts, formation of matching thresholds by which the data is either matched or flagged for manual review by data analysts. The matching threshold should strike a delicate balance between reducing the overall number of duplicate entries and potentially matching entries that should remain separate. Err on the side of duplicates or sending entries for review, if possible.
3. De-Duplication, Standardization and Enrichment
Following the matching phase, you can begin the process of merging and de-duplicating records according to your organization’s survivorship rules. In Profisee, for instance, data stewards review potentially matching records and select “winners.” The records are then merged according to pre-defined survivorship rules that determine which values Profisee should retain when it encounters conflicting or duplicate information. De-duplicated records are then standardized according to data governance policies and optionally enriched with third-party data and address verification services, such as Melissa, Google Maps or Dun & Bradstreet.
4. Golden Record Publication
Finally, with newly created golden records, data can be published from the MDM tool back to the source systems so that every system references the most accurate, complete and up-to-date version of each record. Depending on your organization’s data architecture, you may alternatively publish golden records to a data lake or warehouse, where it can then be accessed by downstream systems like business intelligence, HRIS or ERP.
Make Data Matching Easy With Profisee MDM
Profisee’s MDM platform comes with out-of-the-box features for data matching, survivorship, de-duplication and standardization. Featuring one of the most advanced and flexible matching engines on the market, Profisee lets you automate data matching while still providing an intuitive, user-friendly interface for business users to manually review potentially matching records and monitor data quality.
If you’re interested in learning more about how Profisee’s in-memory, graph-based matching engine can help your organization simplify data matching and improve business outcomes, check out our datasheet below for a more detailed look at how Profisee handles data matching, survivorship and many other essential MDM functions.
Datasheet: Golden Record Management
Forrest Brown
Forrest Brown is the Content Marketing Manager at Profisee and has been writing about B2B tech for eight years, spanning software categories like project management, enterprise resource planning (ERP) and now master data management (MDM). When he's not at work, Forrest enjoys playing music, writing and exploring the Atlanta food scene.