PolyAnalyst comes with several different default dictionaries such as a morphology dictionary, a synonym dictionary, and a dictionary of human names. Those generic dictionaries cover general terms that are useful for a variety of fields. However, when you work with a domain-specific dataset, such as a medical dataset or a car repair dataset, it is crucial to have domain-specific dictionaries that include related terms.
Why do you need domain-specific dictionaries?
As you may know, any query may introduce false positives as well as false negatives. For example, you may have the following note from your car repair dataset; the query for answering “Which car component has what issues?” matches the highlighted parts.
You notice that there is one false positive, “noise at highway speeds”, where “highway speeds” is identified incorrectly as a car part. You also notice that there are two false negatives, “front wipers streaking glass” and “misfire in cylinder 6”, which the query fails to catch. How can you fix it? You may change your query to fix these problems, but it is very likely that you will get more false positives and/or false negatives as you continue to check your results. Do you want to repetitively modify your queries?
Instead of continuously modifying your queries to avoid false positives and to include any false negatives you discovered, a more efficient way to find your desired results is to use dictionaries. For example, one could compile a dictionary of relevant car part terms to improve coverage of false negatives. Then, by generating a stoplist, one can eliminate the false positives that would not be useful words when searching for car parts. In addition to compiling reusable dictionaries, this will allow you to write queries that are clearer and easier to understand for your analysis.
As you can imagine, using dictionaries in your queries can greatly increase the accuracy of your results. When you increase the coverage of your dictionaries, it will be very likely that the coverage of your results will increase as well. For instance, if you are searching for patterns that identify a car part and its issues, you could add known car parts to the dictionary to return reliable results or to discover new ways to articulate an issue. Specifically, if you add “wiper” and “cylinder” to your car part dictionary, the query will automatically match things like “wiper malfunction”, “cylinder malfunction”, “faulty wiper”, “faulty cylinder”, etc.
How do you build such a dictionary?
You can easily build such dictionaries with the help of PolyAnalyst. Using the Entity Extraction node along with the Keyword Extraction node, we can extract terms of interest. Once you have extracted the list of terms with the desired patterns, what should you do next? Is every term in the list what you want? In reality, often you will get junk terms in your list due to the complexity of human language. So then the question is: how to get rid of them?
Luckily, PolyAnalyst allows users to validate their extraction results, so they can ensure that the list of terms being extracted and used in subsequent dictionaries are relevant and accurate.
An example for building a Medical Device Dictionary:
For instance, if you have a medical device repair dataset, you can use this method to generate a dictionary of device components. Since components are nouns, you could use PolyAnalyst, extraction nodes to generate a list of frequently used nouns. Once you have the list of possible components, you can start to validate them, then export the ones that are marked as “Valid” to your device component dictionary. In this way, you obtain a clean dictionary for your analysis.
Below is an example of validating medical device components using the Entity Extraction node with customized queries.
Now you can store these domain-specific nouns in a dictionary. When receiving more medical device repair data, by using the same queries that use this dictionary, you will automatically achieve greater coverage. There is no need to recompile the dictionary for new data.
In brief, domain-specific dictionaries help data analysts maintain simpler queries and get better coverage in an efficient way.