SlideShare a Scribd company logo
Text Mining
By
SATHISHKUMAR G
(sathishsak111@gmail.com)
What is Text Mining?
• There are many examples of text-based documents (all in
‘electronic’ format…)
– e-mails, corporate Web pages, customer surveys, résumés, medical records,
DNA sequences, technical papers, incident reports, news stories and
more…
• Not enough time or patience to read
– Can we extract the most vital kernels of information…
• So, we wish to find a way to gain knowledge (in summarised form)
from all that text, without reading or examining them fully first…!
– Some others (e.g. DNA seq.) are hard to comprehend!
What is Text Mining?
• Traditional data mining uses ‘structured data’ (n x p
matrix)
• The analysis of ‘free-form text’ is also referred to as
‘unstructured data’,
– successful categorisation of such data can be a difficult and
time-consuming task…
• Often, can combine free-form text and structured data to
derive valuable, actionable information… (e.g. as in
typical surveys) – semi-structured
Text Mining: Examples
• Text mining is an exercise to gain knowledge from stores
of language text.
• Text:
– Web pages
– Medical records
– Customer surveys
– Email filtering (spam)
– DNA sequences
– Incident reports
– Drug interaction reports
– News stories (e.g. predict stock movement)
What is Text Mining
• Data examples
– Web pages
– Customer surveys
Customer Age Sex Tenure Comments Outcome
123 24 M 12 years Incorrect
charges on bill
customer angry
Y
243 26 F 1 month Inquiry about
charges to India
N
346 54 M 3 years Question about
charges on bill
N
Amazon.com
Of Mice and Men: Concordance
Concordance is an alphabetized list of the most frequently occurring words in a book,
excluding common words such as "of" and "it." The font size of a word is proportional to the
number of times it occurs in the book.
Of Mice and Men: Text Stats
Text Mining: Yahoo Buzz
Text Mining: Google News
Text Mining
• Typically falls into one of two categories
– Analysis of text: I have a bunch of text I am
interested in, tell me something about it
• E.g. sentiment analysis, “buzz” searches
– Retrieval: There is a large corpus of text documents,
and I want the one closest to a specified query
• E.g. web search, library catalogs, legal and medical
precedent studies
Text Mining: Analysis
• Which words are most present
• Which words are most surprising
• Which words help define the document
• What are the interesting text phrases?
Text Mining: Retrieval
• Find k objects in the corpus of documents which
are most similar to my query.
• Can be viewed as “interactive” data mining -
query not specified a priori.
• Main problems of text retrieval:
– What does “similar” mean?
– How do I know if I have the right documents?
– How can I incorporate user feedback?
Text Retrieval: Challenges
• Calculating similarity is not obvious - what is the distance between
two sentences or queries?
• Evaluating retrieval is hard: what is the “right” answer ? (no
ground truth)
• User can query things you have not seen before e.g. misspelled,
foreign, new terms.
• Goal (score function) is different than in classification/regression:
not looking to model all of the data, just get best results for a
given user.
• Words can hide semantic content
– Synonymy: A keyword T does not appear anywhere in the document, even
though the document is closely related to T, e.g., data mining
– Polysemy: The same keyword may mean different things in different
contexts, e.g., mining
Basic Measures for Text Retrieval
• Precision: the percentage of retrieved documents that are in
fact relevant to the query (i.e., “correct” responses)
• Recall: the percentage of documents that are relevant to the
query and were, in fact, retrieved
recall=
|{Relevant}∩{Retrieved}|
|{Relevant}|
|}{|
|}{}{|
Retrieved
RetrievedRelevant
precision
∩
=
Precision vs. Recall
• We’ve been here before!
– Precision = TP/(TP+FP)
– Recall = TP/(TP+FN)
– Trade off:
• If algorithm is ‘picky’: precision high, recall low
• If algorithm is ‘relaxed’: precision low, recall high
– BUT: recall often hard if not impossible to calculate
Truth:Relvant Truth:Not Relevant
Algorithm:Relevant TP FP
Algorithm: Not Relevant FN TN
16
1 0
1 a b
0 c d
predicted
outcome
actual
outcome
Precision Recall Curves
• If we have a labelled training set, we can calculate recall.
• For any given number of returned documents, we can plot a point
for precision vs. recall. (similar to thresholds in ROC curves)
• Different retrieval algorithms might have very different curves -
hard to tell which is “best”
Term / document matrix
• Most common form of representation in text
mining is the term - document matrix
– Term: typically a single word, but could be a word
phrase like “data mining”
– Document: a generic term meaning a collection of
text to be retrieved
– Can be large - terms are often 50k or larger,
documents can be in the billions (www).
– Can be binary, or use counts
Term document matrix
• Each document now is just a vector of terms, sometimes
boolean
Database SQL Index Regression Likelihood linear
D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
Example: 10 documents: 6 terms
Term document matrix
• We have lost all semantic content
• Be careful constructing your term list!
– Not all words are created equal!
– Words that are the same should be treated the same!
• Stop Words
• Stemming
Stop words
• Many of the most frequently used words in English are worthless in
retrieval and text mining – these words are called stop words.
– the, of, and, to, ….
– Typically about 400 to 500 such words
– For an application, an additional domain specific stop words list may be
constructed
• Why do we need to remove stop words?
– Reduce indexing (or data) file size
• stopwords accounts 20-30% of total word counts.
– Improve efficiency
• stop words are not useful for searching or text mining
• stop words always have a large number of hits
Stemming
• Techniques used to find out the root/stem of a word:
– E.g.,
– user engineering
– users engineered
– used engineer
– using
• stem: use engineer
Usefulness
• improving effectiveness of retrieval and text mining
– matching similar words
• reducing indexing size
– combing words with same roots may reduce indexing size as
much as 40-50%.
Basic stemming methods
• remove ending
– if a word ends with a consonant other than s,
followed by an s, then delete s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word consists only of
one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed unless this
leaves only a single letter.
– …...
• transform words
– if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”
Feature Selection
• Performance of text classification algorithms can be optimized by
selecting only a subset of the discriminative terms
– Even after stemming and stopword removal.
• Greedy search
– Start from full set and delete one at a time
– Find the least important variable
• Can use Gini index for this if a classification problem
• Often performance does not degrade even with orders of
magnitude reductions
– Chakrabarti, Chapter 5: Patent data: 9600 patents in communcation,
electricity and electronics.
– Only 140 out of 20,000 terms needed for classification!
Distances in TD matrices
• Given a term doc matrix represetnation, now we can define
distances between documents (or terms!)
• Elements of matrix can be 0,1 or term frequencies (sometimes
normalized)
• Can use Euclidean or cosine distance
• Cosine distance is the angle between the two vectors
• Not intuitive, but has been proven to work well
• If docs are the same, dc =1, if nothing in common dc=0
• We can calculate cosine and Euclidean distance
for this matrix
• What would you want the distances to look like?
Database SQL Index Regression Likelihood linear
D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
Document distance
• Pairwise distances between documents
• Image plots of cosine distance, Euclidean, and
scaled Euclidean
R function: ‘image’
Weighting in TD space
• Not all phrases are of equal importance
– E.g. David less important than Beckham
– If a term occurs frequently in many documents it has less discriminatory
power
– One way to correct for this is inverse-document frequency (IDF).
– Term importance = Term Frequency (TF) x IDF
– Nj= # of docs containing the term
– N = total # of docs
– A term is “important” if it has a high TF and/or a high IDF.
– TF x IDF is a common measure of term importance
Database SQL Index Regression Likelihood linear
D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
Database SQL Index Regression Likelihood linear
D1 2.53 14.6 4.6 0 0 2.1
D2 3.3 6.7 2.6 0 1.0 0
D3 1.3 11.1 2.6 0 0 0
D4 0.7 4.9 1.0 0 0 0
D5 4.5 21.5 10.2 0 1.0 0
D6 0.2 0 0 12.5 2.5 11.1
D7 0 0 0.5 22.2 4.3 0
D8 0.3 0 0 15.2 1.4 1.4
D9 0.1 0 0 23.56 9.6 17.3
D10 0.6 0 0 11.8 1.4 16.0
TF IDF
Queries
• A query is a representation of the user’s information needs
– Normally a list of words.
• Once we have a TD matrix, queries can be represented as a vector in
the same space
– “Database Index” = (1,0,1,0,0,0)
• Query can be a simple question in natural language
• Calculate cosine distance between query and the TF x IDF version of
the TD matrix
• Returns a ranked vector of documents
Latent Semantic Indexing
• Criticism: queries can be posed in many ways, but still
mean the same
– Data mining and knowledge discovery
– Car and automobile
– Beet and beetroot
• Semantically, these are the same, and documents with
either term are relevant.
• Using synonym lists or thesauri are solutions, but messy
and difficult.
• Latent Semantic Indexing (LSI): tries to extract hidden
semantic structure in the documents
• Search what I meant, not what I said!
LSI
• Approximate the T-dimensional term space using
principle components calculated from the TD matrix
• The first k PC directions provide the best set of k
orthogonal basis vectors - these explain the most
variance in the data.
– Data is reduced to an N x k matrix, without much loss of
information
• Each “direction” is a linear combination of the input
terms, and define a clustering of “topics” in the data.
• What does this mean for our toy example?
Database SQL Index Regression Likelihood linear
D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
Database SQL Index Regression Likelihood linear
D1 2.53 14.6 4.6 0 0 2.1
D2 3.3 6.7 2.6 0 1.0 0
D3 1.3 11.1 2.6 0 0 0
D4 0.7 4.9 1.0 0 0 0
D5 4.5 21.5 10.2 0 1.0 0
D6 0.2 0 0 12.5 2.5 11.1
D7 0 0 0.5 22.2 4.3 0
D8 0.3 0 0 15.2 1.4 1.4
D9 0.1 0 0 23.56 9.6 17.3
D10 0.6 0 0 11.8 1.4 16.0
LSI
• Typically done using Singular Value Decomposition
(SVD) to find principal components
TD matrix
Term weighting by
document - 10 x 6
New orthogonal basis for the
data (PC directions) -
Diagonal matrix of
eigenvalues
For our example: S = (77.4,69.5,22.9,13.5,12.1,4.8)
Fraction of the variance explained (PC1&2) = = 92.5%
LSI
Top 2 PC make new
pseudo-terms to define
documents…
Also, can look at first two Principal components:
(0.74,0.49, 0.27,0.28,0.18,0.19) -> emphasizes first two terms
(-0.28,-0.24,-0.12,0.74,0.37,0.31) -> separates the two clusters
Note how distance from the origin shows number of terms,
And angle (from the origin) shows similarity as well
LSI
• Here we show the same
plot, but with two new
documents, one with the
term “SQL” 50 times,
another with the term
“Databases” 50 times.
• Even though they have no
phrases in common, they
are close in LSI space
Textual analysis
• Once we have the data into a nice matrix
representation (TD, TDxIDF, or LSI), we can
throw the data mining toolbox at it:
– Classification of documents
• If we have training data for classes
– Clustering of documents
• unsupervised
Automatic document classification
• Motivation
– Automatic classification for the tremendous number of on-line text documents (Web
pages, e-mails, etc.)
– Customer comments: Requests for info, complaints, inquiries
• A classification problem
– Training set: Human experts generate a training data set
– Classification: The computer system discovers the classification rules
– Application: The discovered rules can be applied to classify new/unknown documents
• Techniques
– Linear/logistic regression, naïve Bayes
– Trees not so good here due to massive dimension, few interactions
Naïve Bayes Classifier for Text
• Naïve Bayes classifier = conditional independence model
– Also called “multivariate Bernoulli”
– Assumes conditional independence assumption given the class:
p( x | ck ) = Π p( xj | ck )
– Note that we model each term xj as a discrete random variable
In other words, the probability that a bunch of words comes from a given class
equals the product of the individual probabilities of those words.
jNx
j
kjkxk
n
cxpcNpcp ∏=
∝
1
)|()|()|(x
Multinomial Classifier for Text
• Multinomial Classification model
– Assumes that the data are generated by a p-sided die (multinomial model)
– where Nx = number of terms (total count) in document x
– xj = number of times term j occurs in the document
– ck = class = k
– Based on training data, each class has its own multinomial probability across all
words.
.
Naïve Bayes vs. Multinomial
• Many extensions and adaptations of both
• Text mining classification models usually a version of one of these
• Example: Web pages
– Classify webpages from CS departments into:
• student, faculty, course,project
– Train on ~5,000 hand-labeled web pages from Cornell, Washington,
U.Texas, Wisconsin
– Crawl and classify a new site (CMU)
Student Faculty Person Project Course Departmt
Extracted 180 66 246 99 28 1
Correct 130 28 194 72 25 1
Accuracy: 72% 42% 79% 73% 89% 100%
NB vs. multinomial
Highest Probability Terms in Multinomial Distributions
Classifying web pages at a University:
Document Clustering
• Can also do clustering, or unsupervised learning of docs.
• Automatically group related documents based on their
content.
• Require no training sets or predetermined taxonomies.
• Major steps
– Preprocessing
• Remove stop words, stem, feature extraction, lexical analysis, …
– Hierarchical clustering
• Compute similarities applying clustering algorithms, …
– Slicing
• Fan out controls, flatten the tree to desired number of levels.
• Like all clustering examples, success is relative
Document Clustering
• To Cluster:
– Can use LSI
– Another model: Latent Dirichlet Allocation (LDA)
– LDA is a generative probabilistic model of a corpus. Documents are
represented as random mixtures over latent topics, where a topic is
characterized by a distribution over words.
• LDA:
– Three concepts: words, topics, and documents
– Documents are a collection of words and have a probability
distribution over topics
– Topics have a probability distribution over words
– Fully Bayesian Model
LDA
• Assume data was generated by a generative process:
∀ θ is a document - made up from topics from a probability distribution
• z is a topic made up from words from a probability distribution
• w is a word, the only real observables (N=number of words in all documents)
• Then, the LDA equations are specified in a fully Bayesian model:
α=per-document topic distributions
Which can be solved via advance
computational techniques
see Blei, et al 2003
LDA output
• The result can be an often-useful classification of documents into topics, and a
distribution of each topic across words:
Another Look at LDA
• Model: Topics made up of words used to generate documents
Another Look at LDA
• Reality: Documents observed, infer topics
Case Study: TV Listings
• Use text to make recommendations for TV shows
Data Issues
• 10013|In Harm's Way|In Harm's Way|A tough Naval officer faces the enemy
while fighting in the South Pacific during World War II.|A tough Naval
officer faces the enemy while fighting in the South Pacific during World
War II.|en-US| Movie,NR Rating|Movies:Drama|||165|1965|USA||||||STARS-3||
NR|John Wayne, Kirk Douglas, Patricia Neal, Tom Tryon, Paula Prentis s,
Burgess Meredith|Otto Preminger||||Otto Preminger|
Parsed Program Guide entries – 2 weeks, ~66,000 programs, 19,000 words
• Collapse on series (syndicated shows are still a problem)
• Stopwords/stemming, duplication, paid programming, length
normalization
Data Processing
• Combine shows from one series into a ‘canonical’ format
Results
• We fit LDA
– Results in a full distribution of words, topics and documents
– Topics are unveiled which are a collection of words
Results
• For user modelling, consider the collection of shows a single user
watches as a ‘document’ – then look to see what topics (and hence,
words) make up that document
Show mining via text
Text Mining: Helpful Data
• WordNet
Courtesy: Luca Lanzi
Text Mining - Other Topics
• Part of Speech Tagging
– Assign grammatical tags to words (verb, noun, etc)
– Helps in understanding documents : uses Hidden Markov Models
• Named Entity Classification
– Classification task: can we automatically detect proper nouns and tag them
– “Mr. Jones” is a person; “Madison” is a town.
– Helps with dis-ambiguation: spears
Text Mining - Other Topics
• Sentiment Analysis
– Automatically determine tone in text: positive, negative or neutral
– Typically uses collections of good and bad words
– “While the traditional media is slowly starting to take John McCain’s straight talking
image with increasingly large grains of salt, his base isn’t quite ready to give up on their
favorite son. Jonathan Alter’s bizarre defense of McCain after he was caught telling an
outright lie, perfectly captures that reluctance[.]”
– Often fit using Naïve Bayes
• There are sentiment word lists out there:
– See http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis
Text Mining - Other Topics
• Summarizing text: Word Clouds
– Takes text as input, finds the most
interesting ones, and displays them
graphically
– Blogs do this
– Wordle.net
Modest Mouse lyrics
Thank you

More Related Content

What's hot

Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics Presentation
Skylar Ritchie
 
Filling the gaps
Filling the gapsFilling the gaps
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
OpenSource Connections
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
Text mining
Text miningText mining
Text mining
Malik Imran
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
Web search engines
Web search enginesWeb search engines
Web search engines
AbdusamadAbdukarimov2
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
rchbeir
 
Tutorial on automatic summarization
Tutorial on automatic summarizationTutorial on automatic summarization
Tutorial on automatic summarization
Constantin Orasan
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
Marina Santini
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
Gan Keng Hoon
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
KU Leuven
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
mghgk
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
Edad 695 research methodology
Edad 695 research methodologyEdad 695 research methodology
Edad 695 research methodology
Scott Lancaster
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval
Abhay Ratnaparkhi
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
Carsten Eickhoff
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search Engine
Jay R Modi
 

What's hot (20)

Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics Presentation
 
Filling the gaps
Filling the gapsFilling the gaps
Filling the gaps
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Text mining
Text miningText mining
Text mining
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Tutorial on automatic summarization
Tutorial on automatic summarizationTutorial on automatic summarization
Tutorial on automatic summarization
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Edad 695 research methodology
Edad 695 research methodologyEdad 695 research methodology
Edad 695 research methodology
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search Engine
 

Similar to Text Mining

Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
Habtamu100
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
Elsevier
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
HayomeTakele
 
Text mining
Text miningText mining
Text mining
Koshy Geoji
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
Dios Kurniawan
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
Implementing Linked Data in Low-Resource Conditions
Implementing Linked Data in Low-Resource ConditionsImplementing Linked Data in Low-Resource Conditions
Implementing Linked Data in Low-Resource Conditions
AIMS (Agricultural Information Management Standards)
 
Metadata
MetadataMetadata
Metadata
Dorothea Salo
 
Data dictionary
Data dictionaryData dictionary
Data dictionary
Ravi Shekhar
 
Chapter 4 common features of qualitative data analysis
Chapter 4 common features of qualitative data analysisChapter 4 common features of qualitative data analysis
Chapter 4 common features of qualitative data analysis
Mohd. Noor Abdul Hamid
 
week1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.ppt
week1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.pptweek1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.ppt
week1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.ppt
RidoVercascade
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
Clifford James
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
CPP18 - String Parsing
CPP18 - String ParsingCPP18 - String Parsing
CPP18 - String Parsing
Michael Heron
 
Chapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptxChapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptx
Wollo UNiversity
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Simon Hughes
 
Introduction to database
Introduction to databaseIntroduction to database
Introduction to database
Suleman Memon
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
Simon Hughes
 

Similar to Text Mining (20)

Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
 
Text mining
Text miningText mining
Text mining
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Implementing Linked Data in Low-Resource Conditions
Implementing Linked Data in Low-Resource ConditionsImplementing Linked Data in Low-Resource Conditions
Implementing Linked Data in Low-Resource Conditions
 
Metadata
MetadataMetadata
Metadata
 
Data dictionary
Data dictionaryData dictionary
Data dictionary
 
Chapter 4 common features of qualitative data analysis
Chapter 4 common features of qualitative data analysisChapter 4 common features of qualitative data analysis
Chapter 4 common features of qualitative data analysis
 
week1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.ppt
week1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.pptweek1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.ppt
week1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.ppt
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
CPP18 - String Parsing
CPP18 - String ParsingCPP18 - String Parsing
CPP18 - String Parsing
 
Chapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptxChapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptx
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Introduction to database
Introduction to databaseIntroduction to database
Introduction to database
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 

More from sathish sak

TRANSPARENT CONCRE
TRANSPARENT CONCRETRANSPARENT CONCRE
TRANSPARENT CONCRE
sathish sak
 
Stationary Waves
Stationary WavesStationary Waves
Stationary Waves
sathish sak
 
Electrical Activity of the Heart
Electrical Activity of the HeartElectrical Activity of the Heart
Electrical Activity of the Heart
sathish sak
 
Electrical Activity of the Heart
Electrical Activity of the HeartElectrical Activity of the Heart
Electrical Activity of the Heart
sathish sak
 
Software process life cycles
Software process life cyclesSoftware process life cycles
Software process life cycles
sathish sak
 
Digital Logic Circuits
Digital Logic CircuitsDigital Logic Circuits
Digital Logic Circuits
sathish sak
 
Real-Time Scheduling
Real-Time SchedulingReal-Time Scheduling
Real-Time Scheduling
sathish sak
 
Real-Time Signal Processing: Implementation and Application
Real-Time Signal Processing:  Implementation and ApplicationReal-Time Signal Processing:  Implementation and Application
Real-Time Signal Processing: Implementation and Application
sathish sak
 
DIGITAL SIGNAL PROCESSOR OVERVIEW
DIGITAL SIGNAL PROCESSOR OVERVIEWDIGITAL SIGNAL PROCESSOR OVERVIEW
DIGITAL SIGNAL PROCESSOR OVERVIEW
sathish sak
 
FRACTAL ROBOTICS
FRACTAL  ROBOTICSFRACTAL  ROBOTICS
FRACTAL ROBOTICS
sathish sak
 
Electro bike
Electro bikeElectro bike
Electro bike
sathish sak
 
ROBOTIC SURGERY
ROBOTIC SURGERYROBOTIC SURGERY
ROBOTIC SURGERY
sathish sak
 
POWER GENERATION OF THERMAL POWER PLANT
POWER GENERATION OF THERMAL POWER PLANTPOWER GENERATION OF THERMAL POWER PLANT
POWER GENERATION OF THERMAL POWER PLANT
sathish sak
 
mathematics application fiels of engineering
mathematics application fiels of engineeringmathematics application fiels of engineering
mathematics application fiels of engineering
sathish sak
 
Plastics…
Plastics…Plastics…
Plastics…
sathish sak
 
ENGINEERING
ENGINEERINGENGINEERING
ENGINEERING
sathish sak
 
ENVIRONMENTAL POLLUTION
ENVIRONMENTALPOLLUTIONENVIRONMENTALPOLLUTION
ENVIRONMENTAL POLLUTION
sathish sak
 
RFID TECHNOLOGY
RFID TECHNOLOGYRFID TECHNOLOGY
RFID TECHNOLOGY
sathish sak
 
green chemistry
green chemistrygreen chemistry
green chemistry
sathish sak
 
NANOTECHNOLOGY
  NANOTECHNOLOGY	  NANOTECHNOLOGY
NANOTECHNOLOGY
sathish sak
 

More from sathish sak (20)

TRANSPARENT CONCRE
TRANSPARENT CONCRETRANSPARENT CONCRE
TRANSPARENT CONCRE
 
Stationary Waves
Stationary WavesStationary Waves
Stationary Waves
 
Electrical Activity of the Heart
Electrical Activity of the HeartElectrical Activity of the Heart
Electrical Activity of the Heart
 
Electrical Activity of the Heart
Electrical Activity of the HeartElectrical Activity of the Heart
Electrical Activity of the Heart
 
Software process life cycles
Software process life cyclesSoftware process life cycles
Software process life cycles
 
Digital Logic Circuits
Digital Logic CircuitsDigital Logic Circuits
Digital Logic Circuits
 
Real-Time Scheduling
Real-Time SchedulingReal-Time Scheduling
Real-Time Scheduling
 
Real-Time Signal Processing: Implementation and Application
Real-Time Signal Processing:  Implementation and ApplicationReal-Time Signal Processing:  Implementation and Application
Real-Time Signal Processing: Implementation and Application
 
DIGITAL SIGNAL PROCESSOR OVERVIEW
DIGITAL SIGNAL PROCESSOR OVERVIEWDIGITAL SIGNAL PROCESSOR OVERVIEW
DIGITAL SIGNAL PROCESSOR OVERVIEW
 
FRACTAL ROBOTICS
FRACTAL  ROBOTICSFRACTAL  ROBOTICS
FRACTAL ROBOTICS
 
Electro bike
Electro bikeElectro bike
Electro bike
 
ROBOTIC SURGERY
ROBOTIC SURGERYROBOTIC SURGERY
ROBOTIC SURGERY
 
POWER GENERATION OF THERMAL POWER PLANT
POWER GENERATION OF THERMAL POWER PLANTPOWER GENERATION OF THERMAL POWER PLANT
POWER GENERATION OF THERMAL POWER PLANT
 
mathematics application fiels of engineering
mathematics application fiels of engineeringmathematics application fiels of engineering
mathematics application fiels of engineering
 
Plastics…
Plastics…Plastics…
Plastics…
 
ENGINEERING
ENGINEERINGENGINEERING
ENGINEERING
 
ENVIRONMENTAL POLLUTION
ENVIRONMENTALPOLLUTIONENVIRONMENTALPOLLUTION
ENVIRONMENTAL POLLUTION
 
RFID TECHNOLOGY
RFID TECHNOLOGYRFID TECHNOLOGY
RFID TECHNOLOGY
 
green chemistry
green chemistrygreen chemistry
green chemistry
 
NANOTECHNOLOGY
  NANOTECHNOLOGY	  NANOTECHNOLOGY
NANOTECHNOLOGY
 

Recently uploaded

Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
sandeepmenon62
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
Tier1 app
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
ervikas4
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
kalichargn70th171
 
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdfThe Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
kalichargn70th171
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
vaishalijagtap12
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
Manyata Tech Park Bangalore_ Infrastructure, Facilities and More
Manyata Tech Park Bangalore_ Infrastructure, Facilities and MoreManyata Tech Park Bangalore_ Infrastructure, Facilities and More
Manyata Tech Park Bangalore_ Infrastructure, Facilities and More
narinav14
 
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
widenerjobeyrl638
 
Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
Pedro J. Molina
 

Recently uploaded (20)

Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
 
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdfThe Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
Manyata Tech Park Bangalore_ Infrastructure, Facilities and More
Manyata Tech Park Bangalore_ Infrastructure, Facilities and MoreManyata Tech Park Bangalore_ Infrastructure, Facilities and More
Manyata Tech Park Bangalore_ Infrastructure, Facilities and More
 
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
 
Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
 

Text Mining

  • 2. What is Text Mining? • There are many examples of text-based documents (all in ‘electronic’ format…) – e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more… • Not enough time or patience to read – Can we extract the most vital kernels of information… • So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…! – Some others (e.g. DNA seq.) are hard to comprehend!
  • 3. What is Text Mining? • Traditional data mining uses ‘structured data’ (n x p matrix) • The analysis of ‘free-form text’ is also referred to as ‘unstructured data’, – successful categorisation of such data can be a difficult and time-consuming task… • Often, can combine free-form text and structured data to derive valuable, actionable information… (e.g. as in typical surveys) – semi-structured
  • 4. Text Mining: Examples • Text mining is an exercise to gain knowledge from stores of language text. • Text: – Web pages – Medical records – Customer surveys – Email filtering (spam) – DNA sequences – Incident reports – Drug interaction reports – News stories (e.g. predict stock movement)
  • 5. What is Text Mining • Data examples – Web pages – Customer surveys Customer Age Sex Tenure Comments Outcome 123 24 M 12 years Incorrect charges on bill customer angry Y 243 26 F 1 month Inquiry about charges to India N 346 54 M 3 years Question about charges on bill N
  • 7. Of Mice and Men: Concordance Concordance is an alphabetized list of the most frequently occurring words in a book, excluding common words such as "of" and "it." The font size of a word is proportional to the number of times it occurs in the book.
  • 8. Of Mice and Men: Text Stats
  • 11. Text Mining • Typically falls into one of two categories – Analysis of text: I have a bunch of text I am interested in, tell me something about it • E.g. sentiment analysis, “buzz” searches – Retrieval: There is a large corpus of text documents, and I want the one closest to a specified query • E.g. web search, library catalogs, legal and medical precedent studies
  • 12. Text Mining: Analysis • Which words are most present • Which words are most surprising • Which words help define the document • What are the interesting text phrases?
  • 13. Text Mining: Retrieval • Find k objects in the corpus of documents which are most similar to my query. • Can be viewed as “interactive” data mining - query not specified a priori. • Main problems of text retrieval: – What does “similar” mean? – How do I know if I have the right documents? – How can I incorporate user feedback?
  • 14. Text Retrieval: Challenges • Calculating similarity is not obvious - what is the distance between two sentences or queries? • Evaluating retrieval is hard: what is the “right” answer ? (no ground truth) • User can query things you have not seen before e.g. misspelled, foreign, new terms. • Goal (score function) is different than in classification/regression: not looking to model all of the data, just get best results for a given user. • Words can hide semantic content – Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining – Polysemy: The same keyword may mean different things in different contexts, e.g., mining
  • 15. Basic Measures for Text Retrieval • Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) • Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved recall= |{Relevant}∩{Retrieved}| |{Relevant}| |}{| |}{}{| Retrieved RetrievedRelevant precision ∩ =
  • 16. Precision vs. Recall • We’ve been here before! – Precision = TP/(TP+FP) – Recall = TP/(TP+FN) – Trade off: • If algorithm is ‘picky’: precision high, recall low • If algorithm is ‘relaxed’: precision low, recall high – BUT: recall often hard if not impossible to calculate Truth:Relvant Truth:Not Relevant Algorithm:Relevant TP FP Algorithm: Not Relevant FN TN 16 1 0 1 a b 0 c d predicted outcome actual outcome
  • 17. Precision Recall Curves • If we have a labelled training set, we can calculate recall. • For any given number of returned documents, we can plot a point for precision vs. recall. (similar to thresholds in ROC curves) • Different retrieval algorithms might have very different curves - hard to tell which is “best”
  • 18. Term / document matrix • Most common form of representation in text mining is the term - document matrix – Term: typically a single word, but could be a word phrase like “data mining” – Document: a generic term meaning a collection of text to be retrieved – Can be large - terms are often 50k or larger, documents can be in the billions (www). – Can be binary, or use counts
  • 19. Term document matrix • Each document now is just a vector of terms, sometimes boolean Database SQL Index Regression Likelihood linear D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 6 D7 0 0 1 32 12 0 D8 3 0 0 22 4 4 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 Example: 10 documents: 6 terms
  • 20. Term document matrix • We have lost all semantic content • Be careful constructing your term list! – Not all words are created equal! – Words that are the same should be treated the same! • Stop Words • Stemming
  • 21. Stop words • Many of the most frequently used words in English are worthless in retrieval and text mining – these words are called stop words. – the, of, and, to, …. – Typically about 400 to 500 such words – For an application, an additional domain specific stop words list may be constructed • Why do we need to remove stop words? – Reduce indexing (or data) file size • stopwords accounts 20-30% of total word counts. – Improve efficiency • stop words are not useful for searching or text mining • stop words always have a large number of hits
  • 22. Stemming • Techniques used to find out the root/stem of a word: – E.g., – user engineering – users engineered – used engineer – using • stem: use engineer Usefulness • improving effectiveness of retrieval and text mining – matching similar words • reducing indexing size – combing words with same roots may reduce indexing size as much as 40-50%.
  • 23. Basic stemming methods • remove ending – if a word ends with a consonant other than s, followed by an s, then delete s. – if a word ends in es, drop the s. – if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. – If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. – …... • transform words – if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”
  • 24. Feature Selection • Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms – Even after stemming and stopword removal. • Greedy search – Start from full set and delete one at a time – Find the least important variable • Can use Gini index for this if a classification problem • Often performance does not degrade even with orders of magnitude reductions – Chakrabarti, Chapter 5: Patent data: 9600 patents in communcation, electricity and electronics. – Only 140 out of 20,000 terms needed for classification!
  • 25. Distances in TD matrices • Given a term doc matrix represetnation, now we can define distances between documents (or terms!) • Elements of matrix can be 0,1 or term frequencies (sometimes normalized) • Can use Euclidean or cosine distance • Cosine distance is the angle between the two vectors • Not intuitive, but has been proven to work well • If docs are the same, dc =1, if nothing in common dc=0
  • 26. • We can calculate cosine and Euclidean distance for this matrix • What would you want the distances to look like? Database SQL Index Regression Likelihood linear D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 6 D7 0 0 1 32 12 0 D8 3 0 0 22 4 4 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23
  • 27. Document distance • Pairwise distances between documents • Image plots of cosine distance, Euclidean, and scaled Euclidean R function: ‘image’
  • 28. Weighting in TD space • Not all phrases are of equal importance – E.g. David less important than Beckham – If a term occurs frequently in many documents it has less discriminatory power – One way to correct for this is inverse-document frequency (IDF). – Term importance = Term Frequency (TF) x IDF – Nj= # of docs containing the term – N = total # of docs – A term is “important” if it has a high TF and/or a high IDF. – TF x IDF is a common measure of term importance
  • 29. Database SQL Index Regression Likelihood linear D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 6 D7 0 0 1 32 12 0 D8 3 0 0 22 4 4 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 Database SQL Index Regression Likelihood linear D1 2.53 14.6 4.6 0 0 2.1 D2 3.3 6.7 2.6 0 1.0 0 D3 1.3 11.1 2.6 0 0 0 D4 0.7 4.9 1.0 0 0 0 D5 4.5 21.5 10.2 0 1.0 0 D6 0.2 0 0 12.5 2.5 11.1 D7 0 0 0.5 22.2 4.3 0 D8 0.3 0 0 15.2 1.4 1.4 D9 0.1 0 0 23.56 9.6 17.3 D10 0.6 0 0 11.8 1.4 16.0 TF IDF
  • 30. Queries • A query is a representation of the user’s information needs – Normally a list of words. • Once we have a TD matrix, queries can be represented as a vector in the same space – “Database Index” = (1,0,1,0,0,0) • Query can be a simple question in natural language • Calculate cosine distance between query and the TF x IDF version of the TD matrix • Returns a ranked vector of documents
  • 31. Latent Semantic Indexing • Criticism: queries can be posed in many ways, but still mean the same – Data mining and knowledge discovery – Car and automobile – Beet and beetroot • Semantically, these are the same, and documents with either term are relevant. • Using synonym lists or thesauri are solutions, but messy and difficult. • Latent Semantic Indexing (LSI): tries to extract hidden semantic structure in the documents • Search what I meant, not what I said!
  • 32. LSI • Approximate the T-dimensional term space using principle components calculated from the TD matrix • The first k PC directions provide the best set of k orthogonal basis vectors - these explain the most variance in the data. – Data is reduced to an N x k matrix, without much loss of information • Each “direction” is a linear combination of the input terms, and define a clustering of “topics” in the data. • What does this mean for our toy example?
  • 33. Database SQL Index Regression Likelihood linear D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 6 D7 0 0 1 32 12 0 D8 3 0 0 22 4 4 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 Database SQL Index Regression Likelihood linear D1 2.53 14.6 4.6 0 0 2.1 D2 3.3 6.7 2.6 0 1.0 0 D3 1.3 11.1 2.6 0 0 0 D4 0.7 4.9 1.0 0 0 0 D5 4.5 21.5 10.2 0 1.0 0 D6 0.2 0 0 12.5 2.5 11.1 D7 0 0 0.5 22.2 4.3 0 D8 0.3 0 0 15.2 1.4 1.4 D9 0.1 0 0 23.56 9.6 17.3 D10 0.6 0 0 11.8 1.4 16.0
  • 34. LSI • Typically done using Singular Value Decomposition (SVD) to find principal components TD matrix Term weighting by document - 10 x 6 New orthogonal basis for the data (PC directions) - Diagonal matrix of eigenvalues For our example: S = (77.4,69.5,22.9,13.5,12.1,4.8) Fraction of the variance explained (PC1&2) = = 92.5%
  • 35. LSI Top 2 PC make new pseudo-terms to define documents… Also, can look at first two Principal components: (0.74,0.49, 0.27,0.28,0.18,0.19) -> emphasizes first two terms (-0.28,-0.24,-0.12,0.74,0.37,0.31) -> separates the two clusters Note how distance from the origin shows number of terms, And angle (from the origin) shows similarity as well
  • 36. LSI • Here we show the same plot, but with two new documents, one with the term “SQL” 50 times, another with the term “Databases” 50 times. • Even though they have no phrases in common, they are close in LSI space
  • 37. Textual analysis • Once we have the data into a nice matrix representation (TD, TDxIDF, or LSI), we can throw the data mining toolbox at it: – Classification of documents • If we have training data for classes – Clustering of documents • unsupervised
  • 38. Automatic document classification • Motivation – Automatic classification for the tremendous number of on-line text documents (Web pages, e-mails, etc.) – Customer comments: Requests for info, complaints, inquiries • A classification problem – Training set: Human experts generate a training data set – Classification: The computer system discovers the classification rules – Application: The discovered rules can be applied to classify new/unknown documents • Techniques – Linear/logistic regression, naïve Bayes – Trees not so good here due to massive dimension, few interactions
  • 39. Naïve Bayes Classifier for Text • Naïve Bayes classifier = conditional independence model – Also called “multivariate Bernoulli” – Assumes conditional independence assumption given the class: p( x | ck ) = Π p( xj | ck ) – Note that we model each term xj as a discrete random variable In other words, the probability that a bunch of words comes from a given class equals the product of the individual probabilities of those words.
  • 40. jNx j kjkxk n cxpcNpcp ∏= ∝ 1 )|()|()|(x Multinomial Classifier for Text • Multinomial Classification model – Assumes that the data are generated by a p-sided die (multinomial model) – where Nx = number of terms (total count) in document x – xj = number of times term j occurs in the document – ck = class = k – Based on training data, each class has its own multinomial probability across all words. .
  • 41. Naïve Bayes vs. Multinomial • Many extensions and adaptations of both • Text mining classification models usually a version of one of these • Example: Web pages – Classify webpages from CS departments into: • student, faculty, course,project – Train on ~5,000 hand-labeled web pages from Cornell, Washington, U.Texas, Wisconsin – Crawl and classify a new site (CMU) Student Faculty Person Project Course Departmt Extracted 180 66 246 99 28 1 Correct 130 28 194 72 25 1 Accuracy: 72% 42% 79% 73% 89% 100%
  • 43. Highest Probability Terms in Multinomial Distributions Classifying web pages at a University:
  • 44. Document Clustering • Can also do clustering, or unsupervised learning of docs. • Automatically group related documents based on their content. • Require no training sets or predetermined taxonomies. • Major steps – Preprocessing • Remove stop words, stem, feature extraction, lexical analysis, … – Hierarchical clustering • Compute similarities applying clustering algorithms, … – Slicing • Fan out controls, flatten the tree to desired number of levels. • Like all clustering examples, success is relative
  • 45. Document Clustering • To Cluster: – Can use LSI – Another model: Latent Dirichlet Allocation (LDA) – LDA is a generative probabilistic model of a corpus. Documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words. • LDA: – Three concepts: words, topics, and documents – Documents are a collection of words and have a probability distribution over topics – Topics have a probability distribution over words – Fully Bayesian Model
  • 46. LDA • Assume data was generated by a generative process: ∀ θ is a document - made up from topics from a probability distribution • z is a topic made up from words from a probability distribution • w is a word, the only real observables (N=number of words in all documents) • Then, the LDA equations are specified in a fully Bayesian model: α=per-document topic distributions
  • 47. Which can be solved via advance computational techniques see Blei, et al 2003
  • 48. LDA output • The result can be an often-useful classification of documents into topics, and a distribution of each topic across words:
  • 49. Another Look at LDA • Model: Topics made up of words used to generate documents
  • 50. Another Look at LDA • Reality: Documents observed, infer topics
  • 51. Case Study: TV Listings • Use text to make recommendations for TV shows
  • 52. Data Issues • 10013|In Harm's Way|In Harm's Way|A tough Naval officer faces the enemy while fighting in the South Pacific during World War II.|A tough Naval officer faces the enemy while fighting in the South Pacific during World War II.|en-US| Movie,NR Rating|Movies:Drama|||165|1965|USA||||||STARS-3|| NR|John Wayne, Kirk Douglas, Patricia Neal, Tom Tryon, Paula Prentis s, Burgess Meredith|Otto Preminger||||Otto Preminger| Parsed Program Guide entries – 2 weeks, ~66,000 programs, 19,000 words • Collapse on series (syndicated shows are still a problem) • Stopwords/stemming, duplication, paid programming, length normalization
  • 53. Data Processing • Combine shows from one series into a ‘canonical’ format
  • 54. Results • We fit LDA – Results in a full distribution of words, topics and documents – Topics are unveiled which are a collection of words
  • 55. Results • For user modelling, consider the collection of shows a single user watches as a ‘document’ – then look to see what topics (and hence, words) make up that document
  • 56.
  • 58. Text Mining: Helpful Data • WordNet Courtesy: Luca Lanzi
  • 59. Text Mining - Other Topics • Part of Speech Tagging – Assign grammatical tags to words (verb, noun, etc) – Helps in understanding documents : uses Hidden Markov Models • Named Entity Classification – Classification task: can we automatically detect proper nouns and tag them – “Mr. Jones” is a person; “Madison” is a town. – Helps with dis-ambiguation: spears
  • 60. Text Mining - Other Topics • Sentiment Analysis – Automatically determine tone in text: positive, negative or neutral – Typically uses collections of good and bad words – “While the traditional media is slowly starting to take John McCain’s straight talking image with increasingly large grains of salt, his base isn’t quite ready to give up on their favorite son. Jonathan Alter’s bizarre defense of McCain after he was caught telling an outright lie, perfectly captures that reluctance[.]” – Often fit using Naïve Bayes • There are sentiment word lists out there: – See http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis
  • 61. Text Mining - Other Topics • Summarizing text: Word Clouds – Takes text as input, finds the most interesting ones, and displays them graphically – Blogs do this – Wordle.net