Text Mining
Submitted to:
Ms. Mala Kalra
Dr. Rakesh Kumar
Assistant Professor
Department of CSE
NITTTR Chandigarh
Submitted by:
Pankaj Thakur
MECSE (Modular)
RN 171408
Contents
 Introduction & Need
 Information Retrieval and its Methods
 Approaches
 Process
 Techniques used
 Merits & Demerits
 Challenges
 Applications
 Text Mining Computer Programs
 Demo using python
 Latest Research work
 References
 Query
2
Introduction
3
• Data means known facts that can be recorded and that have implicit meaning.[1]
• Database means a collection of related data. [1]
• Data Warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and that usually resides at a single site.[2]
• Data Mining knowledge mining from data .[2]
(extracting knowledge from large amounts of data)
• Text databases(Document databases)
Large collections of documents from various sources:
news articles, research papers, books, digital libraries, e-mail messages, and web pages etc.
(unstructured, semi structured, structured)
• May be highly unstructured (some web pages on www)
• May be semi structured (email messages)
• May be structured ( Library catalogue database)
• Text databases with highly regular structures typically can be implemented
using relational database systems.
• Text Mining is the analysis of data contained in natural language text.
• Regular data mining Vs. Text mining:- in text mining the patterns are extracted from
natural language text rather than from structured databases of facts.[3]
Diagram
4
Text Mining Vs. Data Mining
Data Mining Text Mining
Data Object Numerical & categorical
data
Textual data
Data structure Structured Unstructured &semi-
structured
Data representation Straightforward Complex
Space dimension < tens of thousands > tens of thousands
Methods Data analysis, machine
learning, Data mining,
information
Statistic, neural networks
retrieval, NLP, ...
Maturity Broad implementation
since1994
Broad implementation
starting 2000
Market 105 analysts at large and
mid size companies
108 analysts corporate
workers and individual
users
5
Need of Text Mining
• Massive amount of new information being created doubles every 18 months.
• 80-90% of all data is held in various unstructured formats.
• Useful information can be derived from this unstructured data.
Unstructured or semi-structured
information
Structured, Numerical or coded
information
(News articles, research papers, books, digital libraries, email messages, and web pages )
• Text databases are rapidly growing due to the increasing amount of information available in
Electronic forms, such as electronic publication, various kinds of electronic documents, emails,
and www.
• Most of the information in government, industry, business, and other institutions are stored
Electronically in the form of text databases.
Information Retrieval[2]
6
Information retrieval (IR) is a field that has been developing in parallel with database systems.
Concerned with retrieval of information from a large number of text based documents.
Precision and Recall are two basic measures for accessing the quality of text retrieval.
Precision is the percentage of retrieved documents
that are in fact relevant to the query.
Recall is the percentage of documents that are relevant
to the query and were, in fact, retrieved.
Where {Relevant} is set of documents relevant to a query,
{Retrieved} is the set of documents retrieved.
Information Retrieval Methods[2]
7
Two Categories
IR
Methods
Document
Selection
Methods
Document
Ranking
Methods
• Document Selection
Problem
• Boolean retrieval model
• Document Ranking
Problem
• Vector space model
Vector Space Model[2]
8
• Represent a document and a query both as vectors in a high-dimensional space
corresponding to all the keywords and use an appropriate similarity measure to
compute the similarity between the query vector and the document vector.
• The similarity values can then be used for ranking document.
• Let freq(d, t) = term frequency = no. of occurrences of term t in the document d
• TF(d, t) = term frequency matrix, measures the association of a term t with respect
to the given document d.
TF-IDF(d, t) = TF(d, t) X IDF(t)
0 if freq(d, t) = 0
TF(d, t) =
1+log(1+log(freq(d, t ))) OtherwiseTerm Frequency
Inverse Document Frequency
(represents scaling factor or the importance of term t)
Here, d is the document collection,
dt is the set of documents containing term t.
Vector Space Model[2]
(Example)
9
d/t t1 t2 t3 t4 t5 t6 t7
d1 0 4 10 8 0 5 0
d2 5 19 7 16 0 0 32
d3 15 0 0 4 9 0 17
d4 22 3 12 0 5 15 0
d5 0 7 0 9 2 4 12
A Term Frequency Matrix
For t6 in, d4 we have
TF(d4, t6 ) = 1 + log(1+log(15)) = 1.3374
IDF(t6 ) = log (1+5)/3 = 0.301
TF-IDF(d4, t6) = 1.3377 X 0.301 = 0.403
0 if freq(d, t) = 0
TF(d, t) =
1+log(1+log(freq(d, t ))) Otherwise
d/t t1 t2 t3 t4 t5 t6 t7
d1 0 4 10 8 0 5 0
d2 5 19 7 16 0 0 32
d3 15 0 0 4 9 0 17
d4 22 3 12 0 5 15 0
d5 0 7 0 9 2 4 12
A Term Frequency Matrix
Text Mining Approaches[2]
10
Text Mining
Approaches
Keyword based
approach Tagging approach
Information
extraction
approach
• set of keywords or
terms in the documents
• may only discover relationship
e.g “database” & “system”,
“terrorist” & “explosion”
• may not bring deep
understanding to the text
Input
• set of tags
• may rely on
manual tagging
(costly & not feasible
for large collection of
documents)
• semantic information
(events, facts etc.)
• more advanced
• may lead to the discovery of
some deep knowledge
Text Mining Process[4]
11
Preprocessing
Text Mining
Technique is
applied
Analysis of Text
Text document from
different sources
Discovery of
knowledge
The technologies like
Information extraction, categorization, Clustering, Visualization, Summarization
are used in the text mining process
Techniques Used in Text Mining[4]
1. Information Extraction:
tokenization, identification of named entities, sentence segmentation, and part-of-
speech assignment.
2. Text categorization
procedure of assigning a category to the text among categories predefined by users.
3. Text clustering
procedure of segmenting texts into several clusters, depending on the substantial
relevance.
4. Visualization
improve and simplify the discovery of relevant information.
5. Text summarization
procedure to extract its partial content reflecting its whole contents automatically.
12
Merits and Demerits of Text mining[4]
Merits:
i) The names of different entities and relationship between them can easily be
found from the corpus of documents set (using the technique such as
information extraction. )
ii) The challenging problem of managing great amount of unstructured
information for extracting pattern is solved by text mining.
Demerits:
i) The information which is initially needed is no where written.
ii) To mine the text for information or knowledge no programs can be made in
order to analyze the unstructured text directly.
13
Challenges in Text Mining
(Representation issues)
• Each word has a dictionary meaning, or meanings
Run – (1) the verb. (2) the noun, in cricket
Cricket – (1) The game. (2) The insect.
Apple (the company) or apple (the fruit)
• Ambiguity and context sensitivity - Each word is used in various “senses”
Tendulkar made 100 runs
Because of an injury, Tendulkar can not run and will need a runner between the
wickets
• Capturing the “meaning” of sentences is an important issue as well.
(Grammar, parts of speech, time sense could be easy!)
• Order of words in the query
hot dog stand in the amusement park
hot amusement stand in the dog park
14
Text Mining Applications[5]
15
1. Security applications
(monitoring and analysis of online plain text sources such as Internet news, blogs, etc.
for national security purposes.)
2. Biomedical applications
(studies in protein docking, protein interactions, and protein-disease associations)
3. Software applications
(Within public sector much effort has been concentrated on creating software for
tracking and monitoring terrorist activities.)
4. Online media applications
(The Tribune Company, uses text mining to clarify information and to provide readers
with greater search experiences, which in turn increases site "stickiness" and revenue. )
5. Business and marketing applications
(CRM, to improve predictive analytics models for customer, stock returns prediction)
6. Sentiment analysis
(analysis of movie reviews, used to detect emotions, etc.)
7. Scientific literature mining and academic applications
Text Mining Computer Programs[5]
16
Demo
17
• Text Mining using Python
(Tweeter, Whatsapp Chats)
Latest Research work on Text Mining[6]
1. Sunil Kumar ; Maninder Singh, “Big data analytics for healthcare industry: impact,
applications, and tools”, DOI: 10.26599/BDMA.2018.9020031
2. Bing Li, Xiaochun Yang, Rui Zhou, Bin Wang, Chengfei Liu, Yanchun Zhang, “An
Efficient Method for High Quality and Cohesive Topical Phrase Mining”, DOI:
0.1109/TKDE.2018.2823758
3. Steven H. H. Ding, Benjamin C. M. Fung, Farkhund Iqbal, William K. Cheung,
“Learning Stylometric Representations for Authorship Analysis”, DOI:
10.1109/TCYB.2017.2766189
4. Mohammed Nasri, Younes Jaafar, Karim Bouzoubaa, “Semantic Analysis of Arabic
Texts Within SAFAR Framework”, DOI: 10.1109/CIST.2018.8596491
5. Jayesh Choudhari, Anirban Dasgupta, Indrajit Bhattacharya, Srikanta Bedathur,
“Discovering Topical Interactions in Text-Based Cascades Using Hidden Markov
Hawkes Processes”, DOI: 10.1109/ICDM.2018.00112
6. Yong Luo, Huaizheng Zhang, Yongjie Wang, Yonggang Wen, Xinwen Zhang,
“ResumeNet: A Learning-Based Framework for Automatic Resume Quality
Assessment”, DOI: 10.1109/ICDM.2018.00046
7. Si-Yu Ding, Xu-Ying Liu, Min-Ling Zhang, “Imbalanced Augmented Class Learning with
Unlabeled Data by Label Confidence Propagation”, DOI: 10.1109/ICDM.2018.00023
18
References
[1] Ramez Elmasri and Shamkant B. Navathe, “Fundamentals of database systems”, 6th
edition.
[2] Jiawei Han and Micheline Kamber, “Data Mining, Concepts and Techniques”, 2nd
edition.
[3] http://people.ischool.berkeley.edu/~hearst/text-mining.html
[4] Sonali Vijay Gaikwad, Archana Chaugule, Pramod Patil, “Text Mining Methods and
Techniques”, International Journal of Computer Applications (0975 – 8887),
International Journal of Computer Applications (0975 – 8887), Volume 85 – No 17,
January 2014
[5] http://www.wikipedia.org
[6] https://ieeexplore.org
19
Questions
?
20
21
Thanks!
Data Warehouse
22
Data Source in Delhi
Data Source in Mumbai
Data Source in Kolkata
Data Source in Chennai
Clean
Integrate
Transform
Load
Refresh
Data
Warehouse
Query and
Analysis
Tools
Client
Client
Back

Text mining

  • 1.
    Text Mining Submitted to: Ms.Mala Kalra Dr. Rakesh Kumar Assistant Professor Department of CSE NITTTR Chandigarh Submitted by: Pankaj Thakur MECSE (Modular) RN 171408
  • 2.
    Contents  Introduction &Need  Information Retrieval and its Methods  Approaches  Process  Techniques used  Merits & Demerits  Challenges  Applications  Text Mining Computer Programs  Demo using python  Latest Research work  References  Query 2
  • 3.
    Introduction 3 • Data meansknown facts that can be recorded and that have implicit meaning.[1] • Database means a collection of related data. [1] • Data Warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site.[2] • Data Mining knowledge mining from data .[2] (extracting knowledge from large amounts of data) • Text databases(Document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and web pages etc. (unstructured, semi structured, structured) • May be highly unstructured (some web pages on www) • May be semi structured (email messages) • May be structured ( Library catalogue database) • Text databases with highly regular structures typically can be implemented using relational database systems. • Text Mining is the analysis of data contained in natural language text. • Regular data mining Vs. Text mining:- in text mining the patterns are extracted from natural language text rather than from structured databases of facts.[3] Diagram
  • 4.
    4 Text Mining Vs.Data Mining Data Mining Text Mining Data Object Numerical & categorical data Textual data Data structure Structured Unstructured &semi- structured Data representation Straightforward Complex Space dimension < tens of thousands > tens of thousands Methods Data analysis, machine learning, Data mining, information Statistic, neural networks retrieval, NLP, ... Maturity Broad implementation since1994 Broad implementation starting 2000 Market 105 analysts at large and mid size companies 108 analysts corporate workers and individual users
  • 5.
    5 Need of TextMining • Massive amount of new information being created doubles every 18 months. • 80-90% of all data is held in various unstructured formats. • Useful information can be derived from this unstructured data. Unstructured or semi-structured information Structured, Numerical or coded information (News articles, research papers, books, digital libraries, email messages, and web pages ) • Text databases are rapidly growing due to the increasing amount of information available in Electronic forms, such as electronic publication, various kinds of electronic documents, emails, and www. • Most of the information in government, industry, business, and other institutions are stored Electronically in the form of text databases.
  • 6.
    Information Retrieval[2] 6 Information retrieval(IR) is a field that has been developing in parallel with database systems. Concerned with retrieval of information from a large number of text based documents. Precision and Recall are two basic measures for accessing the quality of text retrieval. Precision is the percentage of retrieved documents that are in fact relevant to the query. Recall is the percentage of documents that are relevant to the query and were, in fact, retrieved. Where {Relevant} is set of documents relevant to a query, {Retrieved} is the set of documents retrieved.
  • 7.
    Information Retrieval Methods[2] 7 TwoCategories IR Methods Document Selection Methods Document Ranking Methods • Document Selection Problem • Boolean retrieval model • Document Ranking Problem • Vector space model
  • 8.
    Vector Space Model[2] 8 •Represent a document and a query both as vectors in a high-dimensional space corresponding to all the keywords and use an appropriate similarity measure to compute the similarity between the query vector and the document vector. • The similarity values can then be used for ranking document. • Let freq(d, t) = term frequency = no. of occurrences of term t in the document d • TF(d, t) = term frequency matrix, measures the association of a term t with respect to the given document d. TF-IDF(d, t) = TF(d, t) X IDF(t) 0 if freq(d, t) = 0 TF(d, t) = 1+log(1+log(freq(d, t ))) OtherwiseTerm Frequency Inverse Document Frequency (represents scaling factor or the importance of term t) Here, d is the document collection, dt is the set of documents containing term t.
  • 9.
    Vector Space Model[2] (Example) 9 d/tt1 t2 t3 t4 t5 t6 t7 d1 0 4 10 8 0 5 0 d2 5 19 7 16 0 0 32 d3 15 0 0 4 9 0 17 d4 22 3 12 0 5 15 0 d5 0 7 0 9 2 4 12 A Term Frequency Matrix For t6 in, d4 we have TF(d4, t6 ) = 1 + log(1+log(15)) = 1.3374 IDF(t6 ) = log (1+5)/3 = 0.301 TF-IDF(d4, t6) = 1.3377 X 0.301 = 0.403 0 if freq(d, t) = 0 TF(d, t) = 1+log(1+log(freq(d, t ))) Otherwise d/t t1 t2 t3 t4 t5 t6 t7 d1 0 4 10 8 0 5 0 d2 5 19 7 16 0 0 32 d3 15 0 0 4 9 0 17 d4 22 3 12 0 5 15 0 d5 0 7 0 9 2 4 12 A Term Frequency Matrix
  • 10.
    Text Mining Approaches[2] 10 TextMining Approaches Keyword based approach Tagging approach Information extraction approach • set of keywords or terms in the documents • may only discover relationship e.g “database” & “system”, “terrorist” & “explosion” • may not bring deep understanding to the text Input • set of tags • may rely on manual tagging (costly & not feasible for large collection of documents) • semantic information (events, facts etc.) • more advanced • may lead to the discovery of some deep knowledge
  • 11.
    Text Mining Process[4] 11 Preprocessing TextMining Technique is applied Analysis of Text Text document from different sources Discovery of knowledge The technologies like Information extraction, categorization, Clustering, Visualization, Summarization are used in the text mining process
  • 12.
    Techniques Used inText Mining[4] 1. Information Extraction: tokenization, identification of named entities, sentence segmentation, and part-of- speech assignment. 2. Text categorization procedure of assigning a category to the text among categories predefined by users. 3. Text clustering procedure of segmenting texts into several clusters, depending on the substantial relevance. 4. Visualization improve and simplify the discovery of relevant information. 5. Text summarization procedure to extract its partial content reflecting its whole contents automatically. 12
  • 13.
    Merits and Demeritsof Text mining[4] Merits: i) The names of different entities and relationship between them can easily be found from the corpus of documents set (using the technique such as information extraction. ) ii) The challenging problem of managing great amount of unstructured information for extracting pattern is solved by text mining. Demerits: i) The information which is initially needed is no where written. ii) To mine the text for information or knowledge no programs can be made in order to analyze the unstructured text directly. 13
  • 14.
    Challenges in TextMining (Representation issues) • Each word has a dictionary meaning, or meanings Run – (1) the verb. (2) the noun, in cricket Cricket – (1) The game. (2) The insect. Apple (the company) or apple (the fruit) • Ambiguity and context sensitivity - Each word is used in various “senses” Tendulkar made 100 runs Because of an injury, Tendulkar can not run and will need a runner between the wickets • Capturing the “meaning” of sentences is an important issue as well. (Grammar, parts of speech, time sense could be easy!) • Order of words in the query hot dog stand in the amusement park hot amusement stand in the dog park 14
  • 15.
    Text Mining Applications[5] 15 1.Security applications (monitoring and analysis of online plain text sources such as Internet news, blogs, etc. for national security purposes.) 2. Biomedical applications (studies in protein docking, protein interactions, and protein-disease associations) 3. Software applications (Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities.) 4. Online media applications (The Tribune Company, uses text mining to clarify information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. ) 5. Business and marketing applications (CRM, to improve predictive analytics models for customer, stock returns prediction) 6. Sentiment analysis (analysis of movie reviews, used to detect emotions, etc.) 7. Scientific literature mining and academic applications
  • 16.
    Text Mining ComputerPrograms[5] 16
  • 17.
    Demo 17 • Text Miningusing Python (Tweeter, Whatsapp Chats)
  • 18.
    Latest Research workon Text Mining[6] 1. Sunil Kumar ; Maninder Singh, “Big data analytics for healthcare industry: impact, applications, and tools”, DOI: 10.26599/BDMA.2018.9020031 2. Bing Li, Xiaochun Yang, Rui Zhou, Bin Wang, Chengfei Liu, Yanchun Zhang, “An Efficient Method for High Quality and Cohesive Topical Phrase Mining”, DOI: 0.1109/TKDE.2018.2823758 3. Steven H. H. Ding, Benjamin C. M. Fung, Farkhund Iqbal, William K. Cheung, “Learning Stylometric Representations for Authorship Analysis”, DOI: 10.1109/TCYB.2017.2766189 4. Mohammed Nasri, Younes Jaafar, Karim Bouzoubaa, “Semantic Analysis of Arabic Texts Within SAFAR Framework”, DOI: 10.1109/CIST.2018.8596491 5. Jayesh Choudhari, Anirban Dasgupta, Indrajit Bhattacharya, Srikanta Bedathur, “Discovering Topical Interactions in Text-Based Cascades Using Hidden Markov Hawkes Processes”, DOI: 10.1109/ICDM.2018.00112 6. Yong Luo, Huaizheng Zhang, Yongjie Wang, Yonggang Wen, Xinwen Zhang, “ResumeNet: A Learning-Based Framework for Automatic Resume Quality Assessment”, DOI: 10.1109/ICDM.2018.00046 7. Si-Yu Ding, Xu-Ying Liu, Min-Ling Zhang, “Imbalanced Augmented Class Learning with Unlabeled Data by Label Confidence Propagation”, DOI: 10.1109/ICDM.2018.00023 18
  • 19.
    References [1] Ramez Elmasriand Shamkant B. Navathe, “Fundamentals of database systems”, 6th edition. [2] Jiawei Han and Micheline Kamber, “Data Mining, Concepts and Techniques”, 2nd edition. [3] http://people.ischool.berkeley.edu/~hearst/text-mining.html [4] Sonali Vijay Gaikwad, Archana Chaugule, Pramod Patil, “Text Mining Methods and Techniques”, International Journal of Computer Applications (0975 – 8887), International Journal of Computer Applications (0975 – 8887), Volume 85 – No 17, January 2014 [5] http://www.wikipedia.org [6] https://ieeexplore.org 19
  • 20.
  • 21.
  • 22.
    Data Warehouse 22 Data Sourcein Delhi Data Source in Mumbai Data Source in Kolkata Data Source in Chennai Clean Integrate Transform Load Refresh Data Warehouse Query and Analysis Tools Client Client Back