DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi
Text Mining – Part 2
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)

Association

Keyword-Based Association Analysis
•  Aims to discover sets of keywords that occur frequently
together in the documents
•  Relies on the usual techniques for mining associative
and correlation rules
•  Each document is considered as a transaction of type

{document id, {set of keywords}}

Why Mining Associations?
•  Association mining may discover set of consecutive or closely-
located keywords, called terms or phrase
§ Compound (e.g., {Stanford,University})
§ Noncompound (e.g., {dollars,shares,exchange})
•  Once discovered the most frequent terms, term-level mining can
be applied most effectively (w.r.t. single word level)
•  They are useful for improving accuracy of many NLP tasks
§ POS tagging, parsing, entity recognition, acronym expansion
§ Grammar learning
4

Why Mining Associations?
•  They are useful for improving accuracy of many NLP tasks
§ POS tagging, parsing, entity recognition, acronym expansion
§ Grammar learning
•  They are directly useful for many applications in text retrieval and
mining
§ Text retrieval (e.g., use word associations to suggest a
variation of a query)
§ Automatic construction of topic map for browsing: words as
nodes and associations as edges
§ Compare and summarize opinions (e.g., what words are most
strongly associated with “battery” in positive and negative
reviews about iPhone 6, respectively?)
5

Basic Word Relations:
Paradigmatic vs. Syntagmatic
•  Paradigmatic
§ A B have paradigmatic relation if they can be substituted for
each other (i.e., A B are in the same class)
§ For example, “cat” and “dog”, “Monday” and “Tuesday”
•  Syntagmatic
§ A B have syntagmatic relation if they can be combined with
each other (i.e., A B are related semantically)
§ For example, “cat” and “sit”, “car” and “drive”
•  These two basic and complementary relations can be generalized
to describe relations of any items in a language
6

Mining Paradigmatic Relations

Mining Syntagmatic Relations

How to Mine Paradigmatic and
Syntagmatic Relations?
•  Paradigmatic
§ Represent each word by its context
§ Compute context similarity
§ Words with high context similarity likely have paradigmatic relation
•  Syntagmatic
§ Count how many times two words occur together in a context
(e.g., sentence or paragraph)
§ Compare their co-occurrences with their individual occurrences
§ Words with high co-occurrences but relatively low individual
occurrences likely have syntagmatic relation
•  Paradigmatically related words tend to have syntagmatic relation with
the same word, thus we can join the discovery of the two relations
9

Sim(“Cat”, “Dog”) =
Sim(Left1(“cat”), Left1(“dog”)) +
Sim(Right1(“cat”), Right1(“dog”)) +
…
+ Sim(Window8(“cat”), Window8(“dog”))=?

High value of Sim(word1, word2) implies that
word1 and word2 are paradigmatically related

Adapting BM25 to Paradigmatic Relation Mining 13

Predicting Words
•  We want to predict whether the word W is present or absent in
a text segment
•  Some words are easier to predict than others
§ W = “the” (easy because frequent)
§ W = “unicorn” (easy because rare)
§ W = “meat” (difficult because not frequent nor rare)
•  Formal definition
§ Binary random variable Xw is 1 if w is present 0 otherwise
§ p(Xw=1) + p(Xw=0) = 1
•  The more random Xw is the more difficult to predict it is
16
How can we measure the “randomness” of Xw? J

Conditional Entropy
•  The entropy of Xw measures the randomness of predicting w
§ The lower the entropy the easier to predict
§ The higher the entropy the more difﬁcult to predict
•  Suppose now that “eat” is present in the segment and that we need to predict
the presence of “meat”
•  Questions
§ Does the presence of “eat” help predicting the presence of “meat”?
§ Does it reduces the uncertainty about “meat”?
§ What if “eat” is not present?
17

Complete Formula

In general, for any discrete random variable
we have that H(X) ≥ H(X|Y)

What is the minimum possible value of H(X|Y)?

Which one is smaller? H(Xmeat|Xthe) or H(Xmeat|Xeat)?

Mining Syntagmatic Relations using Entropy
•  For each word W1
§ For every other word W2 compute H(XW1|XW2)
§ Sort all the candidates in ascending order of H(XW1|XW2)
§ Take the top-ranked candidate words as words that have
potential syntagmatic relations with W1
§ Need to use a threshold for each W1
•  However, while H(XW1|XW2) and H(XW1|XW3) are comparable,
H(XW1|XW2) and H(XW3|XW2) aren’t!
20

Text Classiﬁcation

Document classification
•  Solve the problem of labeling automatically text documents on
the basis of
§ Topic
§ Style
§ Purpose
•  Usual classification techniques can be used to learn from a
training set of manually labeled documents
•  Which features? Keywords can be thousands…
•  Major approaches
§ Similarity-based
§ Dimensionality reduction
§ Naïve Bayes text classifiers
22

Similarity-based Text Classifiers
•  Exploits Information Retrieval and k-nearest-neighbor classifier
§ For a new document to classify, the k most similar documents
in the training set are retrieved
§ Documents are classified on the basis of the class distribution
among the k documents retrieved using the majority vote or a
weighted vote
§ Tuning k is very important to achieve a good performance
•  Limitations
§ Space overhead to store all the documents in training set
§ Time overhead to retrieve the similar documents
23

Dimensionality Reduction for Text Classification
•  As in the Vector Space Model, the goal is to reduce the number
of features to represents text
•  Usual dimensionality reduction approaches in Information
Retrieval are based on the distribution of keywords among the
whole documents database
•  In text classification it is important to consider also the correlation
between keywords and classes
§ Rare keywords have a high TF-IDF but might be uniformly
distributed among classes
§ LSA and LPA do not take into account classes distributions
•  Usual classification techniques can be then applied on reduced
features space are SVMs and Naïve Bayes classifiers
24

•  Document categories {C1 , …, Cn}
•  Document to classify D
•  Probabilistic model:
•  We chose the class C* such that
•  Issues
§ Which features?
§ How to compute the probabilities?
Naïve Bayes for Text Classiﬁcation
( | ) ( )
( | )
( )
i i
i
P D C P C
P C D
P D
=
25

•  Features can be simply deﬁned as the words in the document
•  Let ai be a keyword in the doc, and wj a word in the vocabulary,
we get:
Example
H={like,dislike}
D= “Our approach to representing arbitrary text documents is disturbingly simple”
26

•  Features can be simply deﬁned as the words in the document
•  Let ai be a keyword in the doc, and wj a word in the vocabulary,
•  Assumptions
§ Keywords distributions are inter-independent
§ Keywords distributions are order-independent
27

•  How to compute probabilities?
§ Simply counting the occurrences may lead to wrong results when
probabilities are small
•  M-estimate approach adapted for text:
§ Nc is the whole number of word positions in documents of class C
§ Nc,k is the number of occurrences of wk in documents of class C
§ |Vocabulary| is the number of distinct words in training set
§ Uniform priors are assumed
28

•  Final classification is performed as
•  Despite its simplicity Naïve Bayes classifiers works very well in
practice
•  Applications
§ Newsgroup post classification
§ NewsWeeder (news recommender)

Clustering Text

Text Clustering
•  It involves the same steps as other text mining algorithms to
preprocess the text data into an adequate representation
§ Tokenizing and stemming each synopsis
§ Transforming the corpus into vector space using TF-IDF
§ Calculating cosine distance between each document as a
measure of similarity
§ Clustering the documents using the similarity measure
•  Everything works as with the usual data except that text requires
a rather long preprocessing.
31

Weka DEMO
•  Movie Review Data
§ http://www.cs.cornell.edu/people/pabo/movie-review-data/
•  Load Text Data in Weka (CLI)
33

Weka DEMO (2)
•  From text to word vectors: StringToWordVector
•  Naïve Bayes Classiﬁer
•  Stoplist
§ https://code.google.com/p/stop-words/
•  Attribute Selection
34

DMTM 2015 - 18 Text Mining Part 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to DMTM 2015 - 18 Text Mining Part 2

Similar to DMTM 2015 - 18 Text Mining Part 2 (20)

More from Pier Luca Lanzi

More from Pier Luca Lanzi (20)

Recently uploaded

Recently uploaded (20)

DMTM 2015 - 18 Text Mining Part 2