What is Fuzzy Matching and How Can It Clean Up My Bad Data?

In today’s competitive, ever-changing business landscape, providing significant value for an organization is becoming increasingly difficult.  

Today, over 300 million terabytes of data are created each day, presenting organizations with both powerful raw information and opportunity. But if companies are unable to make sense of that data — and, more importantly, know the insights derived from it can be trusted — bad data can become a brick wall standing in the way of their most strategic initiatives. 

According to a recent Gartner article, 84% of customer service and service support leaders cited customer data and analytics as “very or extremely important” for achieving their organizational goals in 2023. That is a staggering number of organizations that depend on their data quality to grow their business. You can’t trust your strategic initiatives if you can’t trust your foundational data. 

When it comes to improving data quality, it’s important to consolidate disparate or duplicate data from multiple sources and organize it to ensure consistency across your entire business. But because data combined from multiple sources is often incomplete, inconsistent and not suitable for analysis, simply migrating all of it into a single data lake or system cannot solve the underlying data quality problems preventing deeper analysis. 

What is Fuzzy Matching? 

Enter fuzzy matching. Also known as approximate string matching, fuzzy name matching or fuzzy string matching, fuzzy matching is an AI and machine learning technology that identifies and matches similar or partially matching — but not identical — elements in data table sets. 

It is particularly useful when correcting datasets containing typographical errors and spelling/formatting inconsistencies or when comparing data from multiple sources. Fuzzy matching algorithms calculate the similarity between two records and convert them into scores to determine their similarity levels before placing them into a specific dataset. 

This is a foundational first step in building a matching strategy and a core building block of the golden record management found in commercial master data management (MDM) platforms. 

These matching algorithms consider several factors when matching separate records including: 

  • Character similarity: The degree of resemblance or likeness between individual characters within the text. 
  • String distance metrics: Mathematical algorithms that are used to quantify the difference or similarity between two strings of text. Fuzzy Matching uses the Levenshtein distance which measures the minimum number of single-character edits required to transform one string into another. 
  • Phonetic similarity – The degree of similarity between two words or strings based on pronunciation and how they sound to the ear when spoken aloud. 
  • Token-level matching – The process of comparing or matching individual units (tokens) of data within a defined dataset. Tokens often represent words, phrases or sequences of characters. 

A common example of fuzzy matching would be matching and/or grouping customer records. With manual data entry and formatting standards, there are resulting errors or differences.  

Although there might be two separate customer records with one filed under “Jane Depp” and the other under “Jayne Dep,” fuzzy matching can look at the similarities between other pieces of the records to determine if they are for the same customer. Fuzzy algorithms can determine they are likely the same person even if they aren’t an exact match in the source system. 

When is Fuzzy Matching Better Than Exact Matching?  

While exact matching is useful for comparing and matching identical data precisely, it serves no purpose if enterprise data is messy or inconsistent. It does not allow for any variations or differences and requires an exact match. 

Exact matching is used when the data is expected to be consistent and uniform for elements like numbers, text strings or categorical variables. The process comes in handy for things like joining datasets, data validation, filtering, querying and aggregating data. Fuzzy matching is for those data elements that are similar but don’t quite line up with one another. 

The fuzzy matching process is especially useful for elements that comprise a customer 360 view, including customer names, addresses or product descriptions that are similar or duplicate. It takes differences into consideration and looks for slight differences in spelling, formatting or abbreviations. This approach is also a viable option when searching or retrieving information for instances where exact matches are unavailable or unnecessary.  

It can fill in the gaps for certain data connections or relationships that may otherwise be lost with exact matching. If John and Jane are married and live at the same address, then fuzzy matching can see the identical location and connect that they are two separate customers living together and, depending on the business, making joint purchases. This can alter certain marketing or sales opportunities for some organizations with specific customers. 

How Can Fuzzy Matching Be Used to Fix Data Quality Issues? 

The fuzzy matching technique is a huge step in the data cleansing process and is crucial for any digital transformation or strategic initiative. When finding approximate matches between similar information, it identifies and resolves inconsistencies, inaccuracies and discrepancies. Without tackling these issues, your enterprise’s data quality is bound to suffer and harm your business. 

circular graphic displaying data quality rules

When it comes to specific data quality issues, fuzzy matching pinpoints certain problems in particular while enhancing overall quality: 

  1. Duplicate data: Identifies potential or existing duplicates within a defined dataset by comparing multiple attributes and fields. 
  2. Standardization: Standardizes data values by matching to previously established reference data including values and patterns. 
  3. Enrichment: Links and merges similar data from multiple sources, including third-party data validation and enrichment services, to create more accurate datasets while eliminating gaps in information.
  4. Correction: Identifies and corrects errors with data values and automatically applies corrections removing inconsistencies. 
  5. Record Linkage: Links records across several datasets and systems that often lack a unique identifier and compares key attributes to identify similar records for consolidating and integrating multi-source data. 

Fuzzy matching addresses data quality issues to help improve overall data accuracy and consistency which often leads to better decision-making and more reliable analysis. 

Matching is Only the First Step 

Fuzzy matching is a foundational element of data quality management and a great first step in understanding the underlying issues in your enterprise data.  

Read the full guide to data quality to learn more about incorporating data quality management into an effective, data-driven business strategy. 

Facebook
Twitter
LinkedIn

Read the latest from the Profisee blog

Blog_CDOMatRoundUp_OCT2023 Featured Image
NAVIGATING THE DATA FABRIC: INSIGHTS FOR MODERN DATA LEADERS

LET'S DO THIS!

Complete the form below to request your spot at Profisee’s happy hour and dinner at Il Mulino in the Swan Hotel on Tuesday, March 21 at 6:30pm.

CDO Matters Podcast with Malcolm Hawker

NEW EPISODE

Top Data Trends of 2023

Wendy Turner-Williams Headshot

with Wendy Turner-Williams,
Chief Data & AI Officer

REGISTER BELOW

MDM vs. MDS graphic
The Profisee website uses cookies to help ensure you have the best experience possible.  Learn more