Text mining, the art of extracting information from text, requires the formulation of efficient queries that retrieve information based on user input. To do this, the user requires a language for writing queries. For the most basic use cases, the language operators could be regex or string search. But while regex and string search are indispensable for text mining, their utility hits a hard ceiling when semantic or meaningful search is required. They cannot, for example, capture complex entities such as human names, corporations, and drugs. For handling tasks like these, we need a more powerful query language that has semantic understanding, such as Megaputer’s PDL.
So, what is PDL, and how does it achieve semantic understanding while regex and string search do not? Let’s take a look at an example to find out.
First of all, PDL does not just search for the literal form of the word in the query: instead, it automatically extends its search to all morphological forms of the word. For example, when searching for the word “company” in financial news articles, the PDL query will not only find “company”, but also “companies,” the plural form. This feature often comes in handy, especially when the search involves a verb. Suppose that you are interested in extracting what the CEOs said. With regex or other substring search, you will need to list all possible verb forms such as “say,” “saying,” “says,” and “said.” With PDL, simply entering “say” in the query will automatically fetch all possible verb forms. This behavior can also be turned off by enclosing the word in the form function, which will then restrict the search to the literal form of the word, such as in the example below.
Another notable feature of the PDL language is its capability for users to tailor the scope of their searches using a range of built-in functions. Returning to the previous example, you may not wish to confine your search exclusively to the specific verb “say,” but rather include other synonymous verbs like “tell” or “mention.” Achieving this is straightforward with PDL – users can invoke the synonym function with the verb “say,” as demonstrated in (a) below. As the subsequent results table (b) illustrates, the captured text now includes various speech verbs such as “tell,” “emphasize,” and “claim,” in addition to the word “say,” capturing them in all possible verb forms. For additional flexibility, the user can also create and modify synonym dictionaries.
The PDL language offers various modes of information extraction, including proximity search (e.g., finding words A and B within a sentence, or within a 3-words range), syntactic relation (e.g., finding word A that is the subject or object of B), semantic relation (e.g., finding words that are synonyms/antonyms to word A), access to dictionaries and ontologies, and more. This language is expressive enough to capture complex patterns, and yet relatively easy to use, having a syntax that closely resembles English. Having access to this versatile query language significantly enhances the power and quality of text mining operations.
In conclusion, PDL is a powerful and versatile query language that enables users to extract meaningful information from text with greater efficiency and accuracy than competing methods like regex or string search. Its ability to understand and capture morphological forms, synonyms, and other complex patterns makes it an indispensable tool for solving text mining tasks that require semantic understanding. By leveraging the capabilities of PDL, users can enhance their information extraction processes and gain valuable insights from their data, making it a true Swiss army knife of information extraction.