In our daily work we often need to combine two or more datasets together into one. This type of operation, known as a join, is rather simple when each record contains a unique ID present in both datasets. However, there are many scenarios where datasets use different methods of creating unique keys and thus do not match or do not have unique keys at all. In these situations the traditional join operation does not suffice. For example, we have many projects involving the analysis of individual people. One dataset may be from one source such as a hospital which will contain medical data for that person while another dataset may be from another source such as an insurance company which will contain policy information. It is unlikely that these two institutions share the same record keeping system in which real world individuals are given the same unique key in both datasets and so our default method of joining data will not work. For the analysis of such data we require a more advanced method of joining records.
This method is called Fuzzy Join. It allows us to specify the defining attributes of a record which will be used to identify how to join it with other datasets. For example, with the medical and insurance datasets from earlier, the defining attributes would be those that make up ad individual person which may include their name, phone number, address, and date of birth. Note that these attributes are not always assumed to be unique as many people can have the same name or birthdate. Additionally, these attributes rarely have perfectly normal forms as the same name can be spelled in various ways and an address has some components that may be absent in some record keeping system.
Despite this, Fuzzy Join can link records corresponding to the same entity based on this “fuzzy” information by abstracting and normalizing the data and then constructing equivalency classes rather than equality classes. A person in one dataset may have their name spelled differently in another and Fuzzy Join is able to identify that the two are actually the same if enough supplementary information is available.
Fuzzy Join has allowed us to tackle a wide range of new problems often involving datasets from multiple sources with no unique key attribute. It has applications in the medical, insurance, banking, and government sectors where we frequently deal with fuzzy entities such as people and places.
Fuzzy Join is a feature of PolyAnalyst, Megaputer’s data mining software.