Fuzzy matching – Comparing records with string distance measures

May 24, 2018

In previous posts we have loosely discussed the act of comparing strings. Until now this has been a nebulous term relying primarily on intuition to understand but if we want automated systems such as PolyAnalyst to perform what we call comparison it needs to be well defined and based on calculation. This concept is typically referred to as a string metric and is a method of putting into numbers how similar two strings are.

Despite the name, many of these measurements are not in fact metrics by definition because they do not all obey the triangle inequality; however, they remain very useful in the measurement of strings. There are many string metrics of varying degree of complexity and usefulness.

A traditional exemplar of string metrics is the Levenshtein distance.

What is the Levenshtein Distance?

Informally, the Levenshtein distance is the number of edits it takes to turn one string into the other while defining an edit as inserting characters, deleting characters, or substituting characters.

For example, let us compare the words “Kitten” with “Sitting.”

Kitten	Sitting
Sitten	Sitting
Sittin	Sitting
Sitting	Sitting

It took three operations to turn “Kitten” into “Sitting” so we can say the Levenshtein distance between the words is 3.

Other metrics

Other distance metrics include Damerau-Levenshtein that also takes into account transpositions of characters and Jaro-Winkler which considers matching characters and transpositions between strings but adds more complexity in both the definition and calculation.

Basic string metrics do not account for any semantic information about strings, however. They deal solely with characters and measure simply how close in terms of characters are strings alike. This is why, without normalization, string metrics are virtually useless. However, if we normalize our data into small atomic attributes, then we have already isolated the semantic components and so then relying on simple character distances actually provides some useful information to us.

String metrics are not the complete solution but they are an important piece in a system which can achieve fuzzy matching – piece that allows us to quantify how similar entity components are which allows us to automate this process and remove some degree of arbitrariness in an already subjective task.

February 6, 2024

Query languages—the Swiss army knife of information extraction

January 22, 2020

How to Audit and Explore Datasets in PolyAnalyst

December 16, 2019

Fuzzy matching – Comparing records with string distance measures

Fuzzy matching – Comparing records with string distance measures

What is the Levenshtein Distance?

Other metrics

Related Articles

Query languages—the Swiss army knife of information extraction

How to Audit and Explore Datasets in PolyAnalyst

Working with the Twitter API

Contact Us

Software

Solutions

Services