When working with big data, it is often possible to automate the slow, mind-numbing information extraction tasks, for precisely the reason that they require little creative energy. One such automatable process is detecting, extracting, and resolving human names from text data. For example, an insurance company may want to automatically extract from call center notes the names of all parties involved in an accident. However, they process hundreds of thousands of records each week. While a person may look at a single text and easily recognize names, it can be time-consuming or nearly impossible to examine each record individually and compile them into a suitable output.
Using algorithms that people apply unthinkingly to detect names, we can teach software to do this time-consuming work for us.
Some approaches and challenges – Building name dictionaries
The simplest approach to detecting names in big data is to build dictionaries, or lists, of common first names and surnames. As humans, we store prototypes of common names based on their frequency. That is, more frequent names will be more easily recognized with better accuracy, both cognitively and in a name detection algorithm. However, using the dictionary approach alone can limit how many names a name detection algorithm can identify. It does not account for the many innovative and creative first names being coined each day (e.g., Blue Ivy, Jennyfyr, etc.), and the numerous possibilities for words and non-words that can represent last names (e.g., Glide, Nowak, etc.). Additionally, human bias may lead to far less coverage for common foreign names if they are not accounted for in our name dictionary (e.g., Kostas, Eleni).
Rule-based approaches
Often it is more efficient to define abstract rules for common name patterns, rather than creating an infinite list of possible names. In this approach, we can look for other context clues to identify an item as a name. Let’s consider the simplest pattern for a human name in the example below. We can quickly and easily interpret the example as a name and even know which portion is the first or last name based on a few abstract rules.Human readers will notice that there are two items following one another and that they are both capitalized. We also know that the gold item is a very common first name, John, and that it is in most common first name dictionaries. When we see the following blue item, Doe, we know that it does not refer to a female deer because it is also capitalized.
Utilizing this heuristic, we can build a very simple name detection algorithm by searching for two to three capitalized words in a row. Accuracy can be improved by adding known information, such as a list of common first names or surnames. By defining just one simple pattern, we can automate the discovery of previously unknown last names, as in the following example.
Here we can recognize the unknown word, Brouwer, as a last name because it is capitalized and it follows a common first name.
Of course, this simple rule will need to be improved to account for other patterns and to filter out erroneous results. We should limit our results by banning some words with a stop list — a list of frequent words that are not typically names (e.g. verbs like is, was,etc.). Additionally, we must account for less simple name patterns and other challenges in detecting human names, which require more rules and/or more advanced approaches to automation.
One such challenge is defining contexts for first and last names that are real words in English. Say, for example, we had “Hope” in our first name dictionary. It would capture this word as a first name in the following two records:
However, we know as human readers, Hope, as a person name, will never be a verb, as in the first sentence. Similarly, hope, in the sense of expectation or desire, does not typically drive a blue Chevy. So we can define the expected context of the name Hope, as a noun that is capable of actions typical of people (e.g., drive cars, think, etc.). Conversely, we can limit the name Hope to a non-verb.
Adjusting name patterns according to culture
There are many cultural differences in naming conventions that can affect the interpretation of a person’s name. It cannot be assumed that names from all cultures will follow the same pattern. While names across cultures can be comprised of a few simple components, they can have vastly different interpretations according to culture of origin.
Consider the contrast in the example below. A common pattern in the United States is comprised of three components in the following order: a first name, an optional middle name or middle initial, and a last name. However, in Hispanic cultures in the United States and abroad, the same number of components can yield a different underlying structure: a first name and two last names. It would be erroneous to assume that González in this example is a middle name.
In the example below, we can observe that the order of these components is not guaranteed across cultures, either. As in Korean names and names from many other Asian cultures, it is common for surnames to precede first names.
To avoid misinterpreting someone’s surname or first name, it is best to customize an appropriate algorithm for that culture’s naming conventions.
Detecting names automatically in PolyAnalyst
The challenges of name detection illustrated here represent only a small fraction of possible issues one may encounter when designing their own name detection algorithm. Even more challenges can arise with messy data or when there are further conflicts with other proper name entities (e.g., company names). Stay tuned for further blog posts on more advanced approaches to human name detection!