This document provides an introduction to text mining, including definitions of text mining and how it differs from data mining. It describes common areas and applications of text mining such as information retrieval, natural language processing, and information extraction. The document outlines the typical process of text mining including preprocessing, feature generation and selection, and different mining techniques. It also discusses common approaches to text mining such as keyword-based analysis and document classification/clustering. Finally, it notes some challenges of text mining related to unstructured text data.
2. CONTENTS
INTRODUCTION
DATA MINING vs TEXT MINING
AREAS OF TEXT MINING
INFORMATION RETRIEVAL
TEXT MINING PROCESS
TEXT MINING APPROACHES
CHALLENGES OF TEXT MINING
REFERNECES
3. INTRODUCTION
Nowadays, there is a rapid growth in text databases due to many sources
generating data in text.
Sources that generate text databases are : collections of documents from
various sources - such as news articles, research papers, books, digital
libraries, e-mail messages, and World Wide web(which can also be viewed
as a huge, interconnected, dynamic text database) and also many
government and business institutions also store their data in form of text.
Understanding that generated text patterns and obtaining useful and
reliable information has become the main reason for text mining.
4. INTRODUCTION...(CONTD)
Text mining is formally defined as process of extracting relevant
information or pattern from different sources that are in unstructured or
semi-structured format
Data stored in most text databases are semi structured data ,i.e. they are
neither completely unstructured nor completely structured.
For example, a document may contain a few structured fields, such as title,
authors, publication date, category, and so on, but also contain some
largely unstructured text components, such as abstract and contents.
5. DATA MINING vs TEXT MINING
DATA MINING TEXT MINING
It is the process of finding patterns and
extracting useful data from large data sets.
Is applied on data from from various text
documents
Applied on all types of data Applied on text data, which is mostly semi
structured or unstructured
Processing of data is done directly. Processing of data is done linguistically.
Statistical techniques are used to evaluate
data.
Computational linguistic principles are used
to evaluate text.
6. AREAS OF TEXT MINING
IR(Information
Retrieval)
NLP(Natural
Language
Processing)
IE(Information
Extraction)
Data Mining
Query based search on large text documents
The development of the NLP application generally expect
humans to "Speak" to them in a programming language that
is accurate, clear, and exceptionally structured. Human
speech is usually not authentic so that it can depend on
many complex variables, including slang, social context, and
regional dialects.
The automatic extraction of structured data such as entities,
entities relationships, and attributes describing entities from
an unstructured source is called information extraction.
Data mining refers to the extraction of useful data, hidden
patterns from large data sets. Data mining tools can predict
behaviors and future trends that allow businesses to make a
better data-driven decision..
7. INFORMATION RETRIEVAL
Information retrieval is a method to retrieve information from a large number
of text-based documents.
Due to the abundance of text information, information retrieval has found
many applications. There exist many information retrieval systems, such as :
-on-line library catalog systems,
-on-line document management systems, and
-the more recently developed Web search engine
A typical information retrieval problem is to locate relevant documents in a
document collection based on a user’s query, which is often some keywords
describing an information need.
8. INFORMATION RETRIEVAL…(CONTD)
1. BASIC MEASURES OF INFORMATION RETRIEVAL
There are two basic measures for assessing the quality of text retrieval:
Precision: This is the percentage of retrieved documents that are in fact
relevant to the query (i.e., “correct” responses). It is formally defined as
Recall: This is the percentage of documents that are relevant to the query and
were, in fact, retrieved. It is formally defined as
One commonly used trade-off is the F-score, which is defined as the harmonic
mean of recall and precision:
precision = |{Relevant} ∩ {Retrieved}|/ |{Retrieved}|
recall = |{Relevant} ∩ {Retrieved}| /|{Relevant}|
F score = recall × precision (recall + precision)/2
9. INFORMATION RETRIEVAL…(CONTD)
2. TEXT RETRIEVAL METHODS
Information retrieval of text documents can be done by the following methods:
-Document selection method: In this method , the query is given by
specifying constraints for selecting relevant documents. A typical method of this
category is the “Boolean retrieval model”, in which a document is represented by
a set of keywords and a user provides a Boolean expression of keywords, such as
e.g: “car and repair shops” , “tea or coffee”
-Document ranking method: In this method, the query is used to rank all
documents in the order of relevance. The goal is to approximate the degree of
relevance of a document with a score computed based on information such as the
frequency of words in the document and the whole collection.
10. INFORMATION RETRIEVAL…(CONTD)
The first step in most retrieval
systems is to identify keywords for
representing documents, a
preprocessing step often called
tokenization. To avoid indexing
useless words, a text retrieval system
often associates a “stop list” with a
set of documents.
Text Mining is a part of Data Mining
text mining part data
mining
11. TEXT MINING PROCESS
• Text preprocessing
-Syntactic/Semantic
-text analysis (Text cleanup, Tokenization)
• Features Generation
-Bag of words (words it contains and occurences)
-Vector space
• Features Selection
-Simple counting
-Statistics
• Text/Data Mining
-Classification(supervised)
-Clustering(unsupervised)
-Associations(relationships)
• Analyzing results
12. TEXT MINING APPROACHES
The text mining approaches are based on the inputs taken in the text mining
system and the data mining tasks to be performed. In general, the major
approaches, based on the kinds of data they take as input, are:
(1) the keyword-based approach, where the input is a set of keywords or
terms in the documents,
(2) the tagging approach, where the input is a set of tags, and
(3)the information-extraction approach, which inputs semantic
information, such as events, facts, or entities uncovered by information
extraction.
13. 1) KEY WORD ASSOCIATION BASED ANALYSIS:
It is an analysis which collects sets of keywords or terms that occur frequently
together and then finds the association or correlation relationships among them.
E.g. [Stanford, University]
2) DOCUMENT CLASSIFICATION ANALYSIS:
Automated document classification is an important text mining task because,
with the existence of a tremendous number of on-line documents, it is tedious yet
essential to be able to automatically organize such documents into classes to
facilitate document retrieval and subsequent analysis. E.g. Tagging
3) DOCUMENT CLUSTERING ANALYSIS:
Document clustering is one of the most crucial techniques for organizing
documents in an unsupervised manner.
TEXT MINING APPROACHES…(CONTD)
14. CHALLENGES OF TEXT MINING
Information is in unstructured textual form
Large textual database – Difficult to apply text mining
Complex and subtle relationships between concepts in text
Word ambiguity and context sensitivity
e.g windows can be either operating system or opening in the wall to
allow air flow in the house.
Noisy data
Spelling mistakes and irrelevant data(outliers)
15. REFERENCES
[1]Jiawei Han University of Illinois at Urbana-Champaign Micheline Kamber
“Data Mining: Concepts and Techniques Second Edition”
[2] https://www.javatpoint.com/text-data-mining
[3] https://paginas.fe.up.pt/~ec/files_0405/slides/07%20TextMining.pdf