In previous posts we have loosely discussed the act of comparing strings. Until now this has been a nebulous term relying primarily on intuition to understand but if we want automated systems such as PolyAnalyst to perform what we call comparison it needs to be well defined and based on calculation. This concept is typically referred to as a string metric and is a method of putting into numbers how similar two strings are.
Despite the name, many of these measurements are not in fact metrics by definition because they do not all obey the triangle inequality; however, they remain very useful in the measurement of strings. There are many string metrics of varying degree of complexity and usefulness.
A traditional exemplar of string metrics is the Levenshtein distance.
What is the Levenshtein Distance?
Informally, the Levenshtein distance is the number of edits it takes to turn one string into the other while defining an edit as inserting characters, deleting characters, or substituting characters.
For example, let us compare the words “Kitten” with “Sitting.”
Kitten | Sitting |
Sitten | Sitting |
Sittin | Sitting |
Sitting | Sitting |
It took three operations to turn “Kitten” into “Sitting” so we can say the Levenshtein distance between the words is 3.
Other metrics
Other distance metrics include Damerau-Levenshtein that also takes into account transpositions of characters and Jaro-Winkler which considers matching characters and transpositions between strings but adds more complexity in both the definition and calculation.
Basic string metrics do not account for any semantic information about strings, however. They deal solely with characters and measure simply how close in terms of characters are strings alike. This is why, without normalization, string metrics are virtually useless. However, if we normalize our data into small atomic attributes, then we have already isolated the semantic components and so then relying on simple character distances actually provides some useful information to us.
String metrics are not the complete solution but they are an important piece in a system which can achieve fuzzy matching – piece that allows us to quantify how similar entity components are which allows us to automate this process and remove some degree of arbitrariness in an already subjective task.