When we need to compare entities stored as string data we often need to consider many different pieces of the string. For example, suppose we examine the following names in our data:
Steven M Johnson |
Steve Michael Johnson |
When we, as humans, compare these entities, we don’t directly compare each string as a whole. Instead we compare substrings. “Steven” is compared to “Steve,” “M” to “Michael,” and “Johnson” to “Johnson.” We do this because we recognize that names aren’t one complete thing but comprised of smaller attributes – in this case the first, middle, and last names. It logically makes more sense to compare these substrings together rather than the entire string. “Steven M Johnson” as a whole may not be very similar (by whichever way we define that) to “Steve Michael Johnson” but the individual components are not that far off.
This process of deconstructing entities to more atomic components is called normalization and is an essential step in string comparison and fuzzy matching. It allows us to make more educated assessments in comparing strings because we are only comparing substrings that are directly related to one another.
Automating the process of normalization
But we humans naturally understand the semantic components of entities. Creating automated processes to do this is a different matter. Luckily, PolyAnalyst has a solution. By pre-labeling certain data columns as certain semantic categories such as names, addresses, or dates, we can instruct PolyAnalyst on how to normalize the corresponding data with prewritten rules. When we wish to normalize data all we need to do is simply tell PolyAnalyst want kind of data our columns are and it will do the rest.
An example of normalization
For example, suppose we have the following data:
Steve Michael Johnson | 987 S Woodwillow Dr Apt 9 | 2/12/1982 |
These three columns represent different types of semantic information – names, addresses, and dates respectively. Once we label these columns appropriately and run normalization we would end up with something like this:
Name
First Name | Middle Name | Last Name |
Steven | Michael | Johnson |
Address
Number | Modifier | Street | Type | Modifier 2 |
987 | South | Woodwillow | Drive | Apartment 9 |
Date
Month | Day | Year |
2 | 12 | 1982 |
Notice that “S,” “Dr,” and “Apt” have all be expanded. This is another function of normalization. Many strings can represent the same type of information and are semantically equivalent. “S” and “South” are equal as are the rest thus we should arbitrarily choose one and normalize all of the strings to that selected version. This will allow us to be precise when making future comparisons.
Conclusion
Ultimately, normalization is a rather effortless task for an analyst yet incredibly important in the process of fuzzy matching. Ensuring that our data is broken down into atomic pieces for ease of comparison is crucial to obtaining accurate results.