Since the dawn of the mankind people stored their knowledge in the form of textual documents. A rediscovery of ancient Egyptian civilization began in 1799, when soldiers of Napoleon Bonaparte’s army, while constructing a fort during their Egyptian expedition, found a stone pillar with an inscription in three different languages. Two of the languages were different versions of Egyptian hieroglyphics, and the third language was Greek, providing a key in deciphering ancient Egyptian hieroglyphics. This now famous stone is known as the Rosetta stone after the name of a town near the fort. It helped uncover knowledge residing in ancient Egyptian texts found on the walls of their temples and pyramids, and on the discovered papyrus scrolls, and lead to a long chain of truly amazing new discoveries. The Rosetta stone became the cornerstone to our present understanding of the history of ancient Egypt. From this there are two lessons we can learn. The first is that textual documents are one of the primary memory mechanisms of mankind: it is important to reliably store data and have the ability to retrieve proper documents and fragments when necessary. The second lesson is that simply having access to documents is not enough: it takes special skills and efforts to uncover knowledge buried in available data.
Today’s world is both driven by data and overwhelmed with data. Utilizing knowledge derived from data results in more informed and profitable business decisions. At the same time, the volume of stored data is growing at an exponential rate, and traditional storage and analysis methods fail to meet the challenge of coping with this data. New technologies for more efficient data storage, access, and analysis are emerging.
The two most important data formats are numeric and textual data (audio and visual data are utilized occasionally for some special tasks). Point-of-sale and on-line transactional systems, government and legislative materials, corporate documentation, business correspondence, patient records, and of course the Web are among the main sources of data for analysis. Numeric data are primarily stored for the purpose of analysis by governmental and scientific organizations, and by different departments of the larger corporations. Methods for automated processing of numeric data have been developed over decades, powerful tools for computer analysis of numeric data are available, and many monographs are dedicated to the subject. Yet, the dominant amount of useful data is available in the form of text.Written text is the most widespread and reliable way to capture and store for future reference knowledge expressed by means of natural language. This is also one of the main means for delivering information to remote recipients. Letters, books, newspapers, orders, records, memorandums, manuals, instructions, and correspondence, are all examples of textual documents that we all have to deal with every day. Now the challenge is to quickly and reliably find, retrieve, and manage knowledge dispersed in enormous volumes of available data. This challenge is very serious because the amount of data requiring the analysis is growing at an increasing rate, and is very urgent because today the success of an organization critically depends on its ability to utilize knowledge about its own operation and about the world around it. The Rosetta stone of today is represented by technologies for the retrieval and management of knowledge contained in documents.
From government and legislative organizations, to corporations and universities, and to journalists, writers and college students, we all create, store, retrieve, and analyze texts. Correspondingly, numerous organizations are faced with various document warehousing and text analysis tasks. Consider a few simple examples:
- Lawyers, insurers and venture capitalists often have to quickly grasp the meaning of cases, claims and proposals, correspondingly. They need to improve the quality of querying the Web and diverse databases to find and retrieve relevant documents. Their practice could benefit tremendously from automated summarization of texts and feature extraction, when key points from the text are organized in a database holding meta-information to improve future access to knowledge contained in documents.
- Call center specialists have to understand customer support questions, quickly select relevant documents among available manuals, frequently asked questions lists, and engineering notes, and retrieve those bits of knowledge that help answer the question. An automated system for categorizing available materials and retrieving the most relevant fragments matching natural language questions could save hundreds of thousands of man-hours and dramatically reduce response time. Identifying the best fragments through thesauruses and ontologies could significantly improve recall, or the thoroughness of the search.
- Internet search engines could deliver much better quality results by accepting and making sense of natural language queries. If documents found in response to a query were analyzed semantically for their relevance in the context of the original query, it could significantly increase the precision of the search: instead of finding a knockout amount of more than 10,000 documents in response to your query, the system could provide you with a short list of the most relevant documents.
Organizations are miniature copies of the human society. The way they capture, store and recall acquired knowledge mimics the way this is done elsewhere. Textual documents are the primary source of knowledge distributed throughout an organization and beyond. Internal documents provide organizations with memory because knowledge dispersed in these documents can be retrieved and utilized later. External documents provide organizations with eyes and ears by enabling them to access knowledge stored elsewhere. And outgoing documents provide organizations with means to deliver their message to the outside world. Automating document management and analysis tasks and improving the quality and intelligence of the utilized solutions, would provide organizations with an enormous new competitive advantage.
Dan Sullivan’s book you are about to read is dedicated to working with knowledge represented in the format of free form text. If you are facing a text analysis task, this book is for you: it provides answers to numerous questions about document warehousing and semantic text analysis. The book is the first attempt to organize a wealth of new material available in these emerging areas in a very logical and well-structured way. It discusses the best known practices and technologies for storing, managing, retrieving, and analyzing textual documents serving the needs of a modern organization. And for those readers who have an immediate need for implementing a knowledge management system at their company, the book outlines the best available software tools for document warehousing and text mining. I strongly believe that you are going to enjoy reading and learning from this book as much as I did.