stub Fuzzy Matching – Definition, Process and Techniques - Unite.AI
Connect with us

Thought Leaders

Fuzzy Matching – Definition, Process and Techniques

mm

Published

 on

An accenture survey showed that 75% of consumers prefer buying from retailers who know their name and purchasing behavior, and 52% of them are more likely to switch brands if they don’t offer personalized experiences. With millions of data points being captured by brands almost every day, identifying unique customers and building their profiles is one of the biggest challenges faced by most companies.

When an enterprise uses multiple tools for capturing data, it is very common to misspell a customer’s name or accept an email address with an incorrect pattern. Moreover, when disparate data applications have varying information about the same customer, it gets impossible to gain insights into your customer behavior and preferences.

Next, we will learn what fuzzy matching is, how it is implemented, the common techniques used, and the challenges faced. Let’s get started.

What is fuzzy matching?

Fuzzy matching is a data matching technique that compares two or more records and calculates the likelihood of them belonging to the same entity. Rather than broadly categorizing records as a match and non-match, fuzzy matching outputs a number (usually between 0-100%) that identifies how likely it is that these records belong to the same customer, product, employee, etc.

An efficient fuzzy matching algorithm takes care of a range of data ambiguities, such as first/last name reversals, acronyms, shortened names, phonetic and deliberate misspellings, abbreviations, added/removed punctuations, etc.

Fuzzy matching process

The fuzzy matching process is carried out as follows:

  1. Profile records for basic standardization errors. These errors are fixed so that a uniform and standardized view is achieved across records.
  2. Select and map attributes based on which fuzzy matching will take place. Since these attributes may be titled differently, they must be mapped across sources.
  3. Choose a fuzzy matching technique for each attribute. For example, names can be matched based on keyboard distance or name variants, while phone numbers can be matched based on numeric similarity metrics.
  4. Select a weight for each attribute, such that attributes assigned higher weights (or higher priority) will have more impact on the overall match confidence level as compared to fields having lower weights.
  5. Define the threshold level – records with fuzzy matching score higher than the level are considered to be a match and the ones falling short are a non-match.
  6. Run fuzzy matching algorithms and analyze the match results.
  7. Override any false positives and negatives that might come up.
  8. Merge, deduplicate, or simply eliminate the duplicates records.

Fuzzy matching parameters

From the process defined above, you can see that a fuzzy matching algorithm has a number of parameters that form the basis of this technique. These include the attribute weights, fuzzy matching technique, and the score threshold level.

To get optimal results, you must execute fuzzy matching techniques with varying parameters and find the values that suit your data best. Many vendors package such capabilities within their fuzzy matching solution where these parameters are auto-tuned but can be customized depending on your needs.

What are fuzzy matching techniques?

There are many fuzzy matching techniques used today that differ based on the exact algorithm of formula used to compare and match fields. Depending on the nature of your data, you can choose the technique that is suitable for your requirements. Here is a list of common fuzzy matching techniques:

  1. Character-based similarity metrics that are best to match strings. These include:
    1. Edit distance: Calculates the distance between two strings, computed character by character.
    2. Affine gap distance: Calculates the distance between two strings by also considering the gap or spaces between strings.
    3. Smith-Waterman distance: Calculates the distance between two strings by also considering the presence or absence of prefixes and suffixes.
    4. Jaro distance: Best to match on first and last names.
  2. Token-based similarity metrics that are best to match complete words in strings. These include:
    1. Atomic strings: Divides long strings into words delimited by punctuations and compares on individual words.
    2. WHIRL: Similar to atomic strings but WHIRL also assigns weights to each word.
  3. Phonetic similarity metrics that are best to compare words that sound similar but have totally different character composition. These include:
    1. Soundex: Best to compare surnames that are different in spelling but sound similar.
    2. NYSIIS: Similar to Soundex, but it also retains details about vowel position.
    3. Metaphone: Compares similar sounding words that exist in English language, other words familiar to Americans, and first and family names commonly used in the US.
  4. Numeric similarity metrics that compare numbers, how far they are from each other, the distribution of numeric data, etc.

Challenges of fuzzy matching

The fuzzy matching process – despite the amazing benefits it offers – can be quite difficult to implement. Here are some common challenges faced by businesses:

1.     Higher rate of false positives and negatives

Many fuzzy matching solutions have a higher rate of false positives and negatives. This happens when the algorithm incorrectly classifies matches and non-matches or vice versa. Configurable match definitions and fuzzy parameters can help reduce incorrect links as much as possible.

2.     Computational complexity

During the matching process, every record is compared to every other record in the same dataset. And if you are dealing with multiple datasets, then the number of comparisons increases more. It is noticed that comparisons grow quadratically as the database size grows. For this reason, you must use a system that is capable of handling resource-intensive computations.

3.     Validating testing

The matched records are merged together to represent a complete 360 view of entities. Any error incurred during this process can add risk to your business operations. This is why detailed validation testing must be conducted to ensure the tuned algorithm is consistently producing results with high accuracy rate.

Wrap up

Businesses often think of fuzzy matching solutions as complex, resource-intensive, and money-draining projects that run for too long. The truth is investing in the right solution that produces fast and accurate results is the key. Organizations need to consider a number of factors while opting for a fuzzy matching tool, such as the time and money they are willing to invest, the scalability design they have in mind, and the nature of their datasets. This will help them to select a solution that enables them to get the most out of their data.

I’m a Product Marketing Analyst at Data Ladder with a background in IT. I passionately write about real-world data hygiene issues faced by many organizations today. I like to communicate solutions, tips, and practices that can help businesses in achieving inherent data quality in their business intelligence processes. I strive to create content that is targeted towards a wide array of audiences, ranging from technical personnel to end-user, as well as marketing it across various digital platforms.