The document provides an overview of text mining, including:
1. Text mining analyzes unstructured text data through techniques like information extraction, text categorization, clustering, and summarization.
2. It differs from regular data mining as it works with natural language text rather than structured databases.
3. Text mining has various applications including security, biomedicine, software, media, business and more. It faces challenges in representing meaning and context from unstructured text.
Text Mining
Submitted to:
Ms.Mala Kalra
Dr. Rakesh Kumar
Assistant Professor
Department of CSE
NITTTR Chandigarh
Submitted by:
Pankaj Thakur
MECSE (Modular)
RN 171408
2.
Contents
Introduction &Need
Information Retrieval and its Methods
Approaches
Process
Techniques used
Merits & Demerits
Challenges
Applications
Text Mining Computer Programs
Demo using python
Latest Research work
References
Query
2
3.
Introduction
3
• Data meansknown facts that can be recorded and that have implicit meaning.[1]
• Database means a collection of related data. [1]
• Data Warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and that usually resides at a single site.[2]
• Data Mining knowledge mining from data .[2]
(extracting knowledge from large amounts of data)
• Text databases(Document databases)
Large collections of documents from various sources:
news articles, research papers, books, digital libraries, e-mail messages, and web pages etc.
(unstructured, semi structured, structured)
• May be highly unstructured (some web pages on www)
• May be semi structured (email messages)
• May be structured ( Library catalogue database)
• Text databases with highly regular structures typically can be implemented
using relational database systems.
• Text Mining is the analysis of data contained in natural language text.
• Regular data mining Vs. Text mining:- in text mining the patterns are extracted from
natural language text rather than from structured databases of facts.[3]
Diagram
4.
4
Text Mining Vs.Data Mining
Data Mining Text Mining
Data Object Numerical & categorical
data
Textual data
Data structure Structured Unstructured &semi-
structured
Data representation Straightforward Complex
Space dimension < tens of thousands > tens of thousands
Methods Data analysis, machine
learning, Data mining,
information
Statistic, neural networks
retrieval, NLP, ...
Maturity Broad implementation
since1994
Broad implementation
starting 2000
Market 105 analysts at large and
mid size companies
108 analysts corporate
workers and individual
users
5.
5
Need of TextMining
• Massive amount of new information being created doubles every 18 months.
• 80-90% of all data is held in various unstructured formats.
• Useful information can be derived from this unstructured data.
Unstructured or semi-structured
information
Structured, Numerical or coded
information
(News articles, research papers, books, digital libraries, email messages, and web pages )
• Text databases are rapidly growing due to the increasing amount of information available in
Electronic forms, such as electronic publication, various kinds of electronic documents, emails,
and www.
• Most of the information in government, industry, business, and other institutions are stored
Electronically in the form of text databases.
6.
Information Retrieval[2]
6
Information retrieval(IR) is a field that has been developing in parallel with database systems.
Concerned with retrieval of information from a large number of text based documents.
Precision and Recall are two basic measures for accessing the quality of text retrieval.
Precision is the percentage of retrieved documents
that are in fact relevant to the query.
Recall is the percentage of documents that are relevant
to the query and were, in fact, retrieved.
Where {Relevant} is set of documents relevant to a query,
{Retrieved} is the set of documents retrieved.
7.
Information Retrieval Methods[2]
7
TwoCategories
IR
Methods
Document
Selection
Methods
Document
Ranking
Methods
• Document Selection
Problem
• Boolean retrieval model
• Document Ranking
Problem
• Vector space model
8.
Vector Space Model[2]
8
•Represent a document and a query both as vectors in a high-dimensional space
corresponding to all the keywords and use an appropriate similarity measure to
compute the similarity between the query vector and the document vector.
• The similarity values can then be used for ranking document.
• Let freq(d, t) = term frequency = no. of occurrences of term t in the document d
• TF(d, t) = term frequency matrix, measures the association of a term t with respect
to the given document d.
TF-IDF(d, t) = TF(d, t) X IDF(t)
0 if freq(d, t) = 0
TF(d, t) =
1+log(1+log(freq(d, t ))) OtherwiseTerm Frequency
Inverse Document Frequency
(represents scaling factor or the importance of term t)
Here, d is the document collection,
dt is the set of documents containing term t.
9.
Vector Space Model[2]
(Example)
9
d/tt1 t2 t3 t4 t5 t6 t7
d1 0 4 10 8 0 5 0
d2 5 19 7 16 0 0 32
d3 15 0 0 4 9 0 17
d4 22 3 12 0 5 15 0
d5 0 7 0 9 2 4 12
A Term Frequency Matrix
For t6 in, d4 we have
TF(d4, t6 ) = 1 + log(1+log(15)) = 1.3374
IDF(t6 ) = log (1+5)/3 = 0.301
TF-IDF(d4, t6) = 1.3377 X 0.301 = 0.403
0 if freq(d, t) = 0
TF(d, t) =
1+log(1+log(freq(d, t ))) Otherwise
d/t t1 t2 t3 t4 t5 t6 t7
d1 0 4 10 8 0 5 0
d2 5 19 7 16 0 0 32
d3 15 0 0 4 9 0 17
d4 22 3 12 0 5 15 0
d5 0 7 0 9 2 4 12
A Term Frequency Matrix
10.
Text Mining Approaches[2]
10
TextMining
Approaches
Keyword based
approach Tagging approach
Information
extraction
approach
• set of keywords or
terms in the documents
• may only discover relationship
e.g “database” & “system”,
“terrorist” & “explosion”
• may not bring deep
understanding to the text
Input
• set of tags
• may rely on
manual tagging
(costly & not feasible
for large collection of
documents)
• semantic information
(events, facts etc.)
• more advanced
• may lead to the discovery of
some deep knowledge
11.
Text Mining Process[4]
11
Preprocessing
TextMining
Technique is
applied
Analysis of Text
Text document from
different sources
Discovery of
knowledge
The technologies like
Information extraction, categorization, Clustering, Visualization, Summarization
are used in the text mining process
12.
Techniques Used inText Mining[4]
1. Information Extraction:
tokenization, identification of named entities, sentence segmentation, and part-of-
speech assignment.
2. Text categorization
procedure of assigning a category to the text among categories predefined by users.
3. Text clustering
procedure of segmenting texts into several clusters, depending on the substantial
relevance.
4. Visualization
improve and simplify the discovery of relevant information.
5. Text summarization
procedure to extract its partial content reflecting its whole contents automatically.
12
13.
Merits and Demeritsof Text mining[4]
Merits:
i) The names of different entities and relationship between them can easily be
found from the corpus of documents set (using the technique such as
information extraction. )
ii) The challenging problem of managing great amount of unstructured
information for extracting pattern is solved by text mining.
Demerits:
i) The information which is initially needed is no where written.
ii) To mine the text for information or knowledge no programs can be made in
order to analyze the unstructured text directly.
13
14.
Challenges in TextMining
(Representation issues)
• Each word has a dictionary meaning, or meanings
Run – (1) the verb. (2) the noun, in cricket
Cricket – (1) The game. (2) The insect.
Apple (the company) or apple (the fruit)
• Ambiguity and context sensitivity - Each word is used in various “senses”
Tendulkar made 100 runs
Because of an injury, Tendulkar can not run and will need a runner between the
wickets
• Capturing the “meaning” of sentences is an important issue as well.
(Grammar, parts of speech, time sense could be easy!)
• Order of words in the query
hot dog stand in the amusement park
hot amusement stand in the dog park
14
15.
Text Mining Applications[5]
15
1.Security applications
(monitoring and analysis of online plain text sources such as Internet news, blogs, etc.
for national security purposes.)
2. Biomedical applications
(studies in protein docking, protein interactions, and protein-disease associations)
3. Software applications
(Within public sector much effort has been concentrated on creating software for
tracking and monitoring terrorist activities.)
4. Online media applications
(The Tribune Company, uses text mining to clarify information and to provide readers
with greater search experiences, which in turn increases site "stickiness" and revenue. )
5. Business and marketing applications
(CRM, to improve predictive analytics models for customer, stock returns prediction)
6. Sentiment analysis
(analysis of movie reviews, used to detect emotions, etc.)
7. Scientific literature mining and academic applications
Data Warehouse
22
Data Sourcein Delhi
Data Source in Mumbai
Data Source in Kolkata
Data Source in Chennai
Clean
Integrate
Transform
Load
Refresh
Data
Warehouse
Query and
Analysis
Tools
Client
Client
Back