In our analysis of textual data there are many different types of entities we need to identify and capture such as companies, organizations, places, currencies, stocks, and people.
The difficulties in accurately capturing entities in text
This task alone can be challenging, but in many scenarios it is further complicated by the fact that entities can be highly contextual and can be referenced in a myriad of forms. Referencing people easily demonstrates this phenomenon. For example, the individual whose name is Steve Michael Johnson may be referred to as Steve, Steve Johnson, Mr. Johnson, and in other ways. However, this is only the beginning of the complexity. All of these references are simply versions of the full name and so they are not too difficult to link together. However, depending on the context, an individual may be referenced in a manner using no fragment of their human name.
Challenges present in warranty claims data
Consider auto accident reports. In this data an individual may be referenced by a form of their name or they could be simply called “the insured party” or “the claimant”, or any of the shorthand variations thereof. In addition to this, some party may be referenced as their vehicle with a metonym such as “the red Toyota was speeding.” In auto accidents there are usually multiple drivers. In the text, there will be references to “Driver 1” or “the second driver”, which represent specific parties without mentioning them directly. Finally, people can also be referenced with pronouns in anaphora as in “He did not see the red light.”
Linking entities together
Identifying names of individuals is one thing, but we often need to link those entities with other entities we discover in the text to create a more holistic representation of knowledge the text contains. This task is called Entity Resolution. A major application of Entity Resolution is in the example examined above with auto accident reports. If an insurance company wished to analyze these reports for subrogation, litigation, or liability assessment, then it is essential to accurately identify two parties – the insured and the claimant – and extract the actions these parties took and the conditions they were in. As we saw, these parties can be referenced dynamically from record to record. We cannot simply build a list of names to search for ahead of time because human names can be shared and Mr. Smith may be an insured person in one record while a different Mr. Smith may be a claimant in another record. The same is true for other types of references such as vehicles and drivers. The red Toyota could be driven by anyone and Driver One is not always the insured party or the claimant party. Instead of relying on pre-assumed knowledge, we have to dynamically search for information within each record that can build a profile of the two parties including their names, what they drive, and what driver they are being referenced as.
Solving the problem with PolyAnalyst’s Extensible Pattern Definition Language
This technique is achieved by using PolyAnalyst’s Entity Extraction capabilities with XPDL and is broken into several steps. The first step is to look for these references independently. These means searching for all names and vehicles regardless of what we know about their relationships to the parties. After we have identified these entities we can, in the second step, create links between them and the respective parties they belong to. For example, the text may contain “The insured, Steve Johnson, was rear-ended” and other types of phrases that link entities together semantically. Here we now know that the insured party has the name Steve Johnson. After we search the text for semantic links between the entities, we can enact the third step called post-processing. Here we propagate the knowledge obtained in step two across entities. For example, whenever we see “Steve Johnson” or variations thereof in the text, we can mark it as the insured party. Furthermore, this allows us to create jumps in deduction. If we spot “The insured, Steve Johnson, was rear-ended” and “Steve was in a white F-150” we may not initially conclude that the insured owned a white Ford F-150 in step two but by propagating this information across the data we can deduce such new links.
Conclusion
This three step process of finding isolated entities, establishing logical links between them, and aggregating that information in post-processing, allows us to achieve Entity Resolution within the text in PolyAnalyst. This technique greatly expands how much information we can reliably extract from the text with XPDL. It can be applied to any scenario where complex entities such as people can be referenced in multiple manners.