Data mining is the process of extracting patterns from data. Data mining is becoming an increasingly important tool to transform the data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery. Data mining can be applied on a variety of data types. Data types include structured data (relational), multimedia data, free text, and hypertext.
Nowadays, text is the most common and convenient way for information exchange. This due to the fact that much of the world’s data is contained in text documents (newspaper articles, emails, literature, web pages, etc.). The importance of this way has led many researchers to find out suitable methods to analyze natural language texts to extract the important and useful information. In comparison with data stored in structured format (databases), texts stored in documents is unstructured and to deal with such data, a preprocessing is required to transform textual data into a suitable format for automatic processing.
Text mining is a new and exciting area of computer science research that interested of solving the problem of information overload by using combination techniques from data mining, machine learning, natural language processing, information retrieval, and knowledge management. Text mining, also known as text data mining or knowledge discovery from textual databases, refers generally to the automatic process of extracting interesting and high-quality information or knowledge from unstructured text documents by using a suite of analysis tools.
Definitely, text mining takes much of its inspiration and direction from core research on data mining. Therefore, text mining and data mining systems contain many high-level architectural similarities. For example, text mining and data mining systems depend on preprocessing routines, pattern-discovery algorithms, and presentation-layer elements. Furthermore, text mining adopts many of the specific types of patterns in its core knowledge discovery operations that were first introduced and vetted in data mining research.
The difference between data mining and text mining lies in the specific stages of preparation of the data and the difficulty of finding the important patterns due to the semi-structured or unstructured nature of the textual documents being processed.
Data mining systems assumes that data have already been stored in a structured format. Therefore, the preprocessing stage focus falls on two critical tasks: Scrubbing and normalizing data and creating extensive numbers of table joins. In contrast, for text mining systems, preprocessing tasks focus on the identification and extraction of representative features for natural language documents. These preprocessing tasks are responsible for transforming unstructured, original-format content in document collections into a more explicitly structured intermediate format, which is a concern that is not relevant for most data mining systems. Text mining preprocessing tasks include a variety of different types of techniques culled and adapted from information retrieval, information extraction, and computational linguistics research (such as tokenization, stop word remover, normalization, and stemming, etc.)
Typical text mining tasks involving Text extraction and representation, information retrieval, document summarization, document clustering, document classification.
- Text representation is concerned with the problem of how to represent text data in appropriate format for automatic processing. In general, documents can be represented in two ways, as a bag of words where the context and the word order are neglected and the other one is to find common phrases in text and deal with them as single terms.
- In information retrieval, the information needed to be retrieved is represented as query and the task of the information retrieval systems is to find and return documents that contain the most relevant information to the given query. In order to achieve this purpose, text mining techniques are used to analyse text data and make a comparison between the extracted information and the given queries to find out documents that include answers.
- The idea of text summarization is an automatic detection of the most important phrases in a given text document and to create a condensed version of the input text for human use. Text summarization can be done for a single document or a document collection (multi-document summarization). Most approaches in this area focus on extracting informative sentences from texts and building summaries based on the extracted information. Recently, many approaches have been tried to create summaries based on semantic information extracted from given text documents.
- Document clustering is a machine learning technique that is used to identify the similarity between text documents based on their content. Unlike document classification, document clustering is an unsupervised method in which there are no pre-defined categories. The idea of document clustering is to create links between similar documents in a document collection to allow them to be retrieved together.
- Document classification is the assignment of text documents into one or more pre-defined categories based on their content. It is a supervised learning problem where the categories are known in advance. For the document classification problem, many machine learning techniques including decision trees, K-nearest neighbor, SVM support vector machines and Naive Bayes algorithm have been used to build document classification models.