SlideShare a Scribd company logo
1 of 34
Download to read offline
Prof. Pier Luca Lanzi
Text Mining – Part 2
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Association
Prof. Pier Luca Lanzi
Keyword-Based Association Analysis
•  Aims to discover sets of keywords that occur frequently 
together in the documents
•  Relies on the usual techniques for mining associative
and correlation rules
•  Each document is considered as a transaction of type

{document id, {set of keywords}}
Prof. Pier Luca Lanzi
Why Mining Associations?
•  Association mining may discover set of consecutive or closely-
located keywords, called terms or phrase
§ Compound (e.g., {Stanford,University})
§ Noncompound (e.g., {dollars,shares,exchange})
•  Once discovered the most frequent terms, term-level mining can
be applied most effectively (w.r.t. single word level)
•  They are useful for improving accuracy of many NLP tasks
§ POS tagging, parsing, entity recognition, acronym expansion
§ Grammar learning
4
Prof. Pier Luca Lanzi
Why Mining Associations?
•  They are useful for improving accuracy of many NLP tasks
§ POS tagging, parsing, entity recognition, acronym expansion
§ Grammar learning
•  They are directly useful for many applications in text retrieval and
mining
§ Text retrieval (e.g., use word associations to suggest a
variation of a query)
§ Automatic construction of topic map for browsing: words as
nodes and associations as edges
§ Compare and summarize opinions (e.g., what words are most
strongly associated with “battery” in positive and negative
reviews about iPhone 6, respectively?)
5
Prof. Pier Luca Lanzi
Basic Word Relations:
Paradigmatic vs. Syntagmatic 
•  Paradigmatic
§ A  B have paradigmatic relation if they can be substituted for
each other (i.e., A  B are in the same class)
§ For example, “cat” and “dog”, “Monday” and “Tuesday”
•  Syntagmatic
§ A  B have syntagmatic relation if they can be combined with
each other (i.e., A  B are related semantically)
§ For example, “cat” and “sit”, “car” and “drive”
•  These two basic and complementary relations can be generalized
to describe relations of any items in a language
6
Prof. Pier Luca Lanzi
Mining Paradigmatic Relations
Prof. Pier Luca Lanzi
Mining Syntagmatic Relations
Prof. Pier Luca Lanzi
How to Mine Paradigmatic and
Syntagmatic Relations?
•  Paradigmatic
§ Represent each word by its context
§ Compute context similarity
§ Words with high context similarity likely have paradigmatic relation
•  Syntagmatic
§ Count how many times two words occur together in a context
(e.g., sentence or paragraph)
§ Compare their co-occurrences with their individual occurrences
§ Words with high co-occurrences but relatively low individual
occurrences likely have syntagmatic relation
•  Paradigmatically related words tend to have syntagmatic relation with
the same word, thus we can join the discovery of the two relations
9
Prof. Pier Luca Lanzi
Mining Paradigmatic Relations
Prof. Pier Luca Lanzi
Mining Paradigmatic Relations
Prof. Pier Luca Lanzi
Sim(“Cat”, “Dog”) =
Sim(Left1(“cat”), Left1(“dog”)) +
Sim(Right1(“cat”), Right1(“dog”)) +
…
+ Sim(Window8(“cat”), Window8(“dog”))=?

High value of Sim(word1, word2) implies that 
word1 and word2 are paradigmatically related
Prof. Pier Luca Lanzi
Adapting BM25 to Paradigmatic Relation Mining 13
Prof. Pier Luca Lanzi
Mining Syntagmatic Relations
Prof. Pier Luca Lanzi
Mining Syntagmatic Relations
Prof. Pier Luca Lanzi
Predicting Words
•  We want to predict whether the word W is present or absent in
a text segment
•  Some words are easier to predict than others
§ W = “the” (easy because frequent)
§ W = “unicorn” (easy because rare)
§ W = “meat” (difficult because not frequent nor rare)
•  Formal definition
§ Binary random variable Xw is 1 if w is present 0 otherwise
§ p(Xw=1) + p(Xw=0) = 1
•  The more random Xw is the more difficult to predict it is
16
How can we measure the “randomness” of Xw? J
Prof. Pier Luca Lanzi
Conditional Entropy
•  The entropy of Xw measures the randomness of predicting w
§ The lower the entropy the easier to predict
§ The higher the entropy the more difficult to predict
•  Suppose now that “eat” is present in the segment and that we need to predict
the presence of “meat”
•  Questions
§ Does the presence of “eat” help predicting the presence of “meat”?
§ Does it reduces the uncertainty about “meat”?
§ What if “eat” is not present?
17
Prof. Pier Luca Lanzi
Complete Formula
Prof. Pier Luca Lanzi
In general, for any discrete random variable
we have that H(X) ≥ H(X|Y)

What is the minimum possible value of H(X|Y)?

Which one is smaller? H(Xmeat|Xthe) or H(Xmeat|Xeat)?
Prof. Pier Luca Lanzi
Mining Syntagmatic Relations using Entropy
•  For each word W1
§ For every other word W2 compute H(XW1|XW2)
§ Sort all the candidates in ascending order of H(XW1|XW2)
§ Take the top-ranked candidate words as words that have
potential syntagmatic relations with W1
§ Need to use a threshold for each W1
•  However, while H(XW1|XW2) and H(XW1|XW3) are comparable,
H(XW1|XW2) and H(XW3|XW2) aren’t!
20
Prof. Pier Luca Lanzi
Text Classification
Prof. Pier Luca Lanzi
Document classification
•  Solve the problem of labeling automatically text documents on
the basis of
§ Topic
§ Style
§ Purpose
•  Usual classification techniques can be used to learn from a
training set of manually labeled documents
•  Which features? Keywords can be thousands…
•  Major approaches
§ Similarity-based
§ Dimensionality reduction
§ Naïve Bayes text classifiers
22
Prof. Pier Luca Lanzi
Similarity-based Text Classifiers
•  Exploits Information Retrieval and k-nearest-neighbor classifier
§ For a new document to classify, the k most similar documents
in the training set are retrieved
§ Documents are classified on the basis of the class distribution
among the k documents retrieved using the majority vote or a
weighted vote
§ Tuning k is very important to achieve a good performance
•  Limitations
§ Space overhead to store all the documents in training set
§ Time overhead to retrieve the similar documents
23
Prof. Pier Luca Lanzi
Dimensionality Reduction for Text Classification
•  As in the Vector Space Model, the goal is to reduce the number
of features to represents text
•  Usual dimensionality reduction approaches in Information
Retrieval are based on the distribution of keywords among the
whole documents database
•  In text classification it is important to consider also the correlation
between keywords and classes
§ Rare keywords have a high TF-IDF but might be uniformly
distributed among classes
§ LSA and LPA do not take into account classes distributions
•  Usual classification techniques can be then applied on reduced
features space are SVMs and Naïve Bayes classifiers
24
Prof. Pier Luca Lanzi
•  Document categories {C1 , …, Cn}
•  Document to classify D
•  Probabilistic model:
•  We chose the class C* such that
•  Issues
§ Which features?
§ How to compute the probabilities?
Naïve Bayes for Text Classification
( | ) ( )
( | )
( )
i i
i
P D C P C
P C D
P D
=
25
Prof. Pier Luca Lanzi
Naïve Bayes for Text Classification
•  Features can be simply defined as the words in the document
•  Let ai be a keyword in the doc, and wj a word in the vocabulary,
we get:
Example
H={like,dislike}
D= “Our approach to representing arbitrary text documents is disturbingly simple”
26
Prof. Pier Luca Lanzi
Naïve Bayes for Text Classification
•  Features can be simply defined as the words in the document
•  Let ai be a keyword in the doc, and wj a word in the vocabulary,
•  Assumptions
§ Keywords distributions are inter-independent
§ Keywords distributions are order-independent
27
Prof. Pier Luca Lanzi
Naïve Bayes for Text Classification
•  How to compute probabilities?
§ Simply counting the occurrences may lead to wrong results when
probabilities are small
•  M-estimate approach adapted for text:
§ Nc is the whole number of word positions in documents of class C
§ Nc,k is the number of occurrences of wk in documents of class C
§ |Vocabulary| is the number of distinct words in training set
§ Uniform priors are assumed
28
Prof. Pier Luca Lanzi
Naïve Bayes for Text Classification
•  Final classification is performed as
•  Despite its simplicity Naïve Bayes classifiers works very well in
practice
•  Applications
§ Newsgroup post classification
§ NewsWeeder (news recommender)
Prof. Pier Luca Lanzi
Clustering Text
Prof. Pier Luca Lanzi
Text Clustering
•  It involves the same steps as other text mining algorithms to
preprocess the text data into an adequate representation
§ Tokenizing and stemming each synopsis
§ Transforming the corpus into vector space using TF-IDF
§ Calculating cosine distance between each document as a
measure of similarity
§ Clustering the documents using the similarity measure
•  Everything works as with the usual data except that text requires
a rather long preprocessing.
31
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Weka DEMO
•  Movie Review Data
§ http://www.cs.cornell.edu/people/pabo/movie-review-data/
•  Load Text Data in Weka (CLI)
33
Prof. Pier Luca Lanzi
Weka DEMO (2)
•  From text to word vectors: StringToWordVector
•  Naïve Bayes Classifier
•  Stoplist
§ https://code.google.com/p/stop-words/
•  Attribute Selection
34

More Related Content

What's hot

Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsDMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsPier Luca Lanzi
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jWilliam Lyon
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结君 廖
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringPier Luca Lanzi
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsMulti modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsRoelof Pieters
 

What's hot (20)

Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsDMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsMulti modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed models
 

Viewers also liked

DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsPier Luca Lanzi
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesPier Luca Lanzi
 
DMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationDMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationPier Luca Lanzi
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringPier Luca Lanzi
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesPier Luca Lanzi
 
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesDMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesPier Luca Lanzi
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringPier Luca Lanzi
 
DMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringDMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringPier Luca Lanzi
 
Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Pier Luca Lanzi
 
DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionPier Luca Lanzi
 
DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningPier Luca Lanzi
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesPier Luca Lanzi
 
Idea Generation and Conceptualization
Idea Generation and ConceptualizationIdea Generation and Conceptualization
Idea Generation and ConceptualizationPier Luca Lanzi
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesPier Luca Lanzi
 
Text Mining Using JBoss Rules
Text Mining Using JBoss RulesText Mining Using JBoss Rules
Text Mining Using JBoss RulesMark Maslyn
 
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Pier Luca Lanzi
 
Working with Formal Elements
Working with Formal ElementsWorking with Formal Elements
Working with Formal ElementsPier Luca Lanzi
 

Viewers also liked (20)

DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
 
Course Organization
Course OrganizationCourse Organization
Course Organization
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
DMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationDMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to Classification
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision Trees
 
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesDMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification Rules
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
DMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringDMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based Clustering
 
Course Introduction
Course IntroductionCourse Introduction
Course Introduction
 
Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016
 
DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course Introduction
 
DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data Mining
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association Rules
 
Idea Generation and Conceptualization
Idea Generation and ConceptualizationIdea Generation and Conceptualization
Idea Generation and Conceptualization
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Text Mining Using JBoss Rules
Text Mining Using JBoss RulesText Mining Using JBoss Rules
Text Mining Using JBoss Rules
 
Selling Text Analytics to your boss
Selling Text Analytics to your bossSelling Text Analytics to your boss
Selling Text Analytics to your boss
 
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
 
Working with Formal Elements
Working with Formal ElementsWorking with Formal Elements
Working with Formal Elements
 

Similar to DMTM 2015 - 18 Text Mining Part 2

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Zachary S. Brown
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
https://www.slideshare.net/amaresimachew/hot-topics-132093738
https://www.slideshare.net/amaresimachew/hot-topics-132093738https://www.slideshare.net/amaresimachew/hot-topics-132093738
https://www.slideshare.net/amaresimachew/hot-topics-132093738Assosa University
 
Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOKoray Tugberk GUBUR
 
lexical-semantics-221118101910-ccd46ac3.pdf
lexical-semantics-221118101910-ccd46ac3.pdflexical-semantics-221118101910-ccd46ac3.pdf
lexical-semantics-221118101910-ccd46ac3.pdfGagu6
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AISATHYANARAYANAKB
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018Andre Freitas
 
Analyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in PythonAnalyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in PythonAbhinav Gupta
 
Written Language in Discourse Analysis
Written Language in Discourse AnalysisWritten Language in Discourse Analysis
Written Language in Discourse Analysisfarshad azimi
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
 
Text as Connected Discourse (HUMSS-11 SENIOR HIGH)
Text as Connected Discourse (HUMSS-11 SENIOR HIGH)Text as Connected Discourse (HUMSS-11 SENIOR HIGH)
Text as Connected Discourse (HUMSS-11 SENIOR HIGH)HYPERGEN2007
 
Knowledge of meaning an introduction-to_semantic_theory-buku
Knowledge of meaning   an introduction-to_semantic_theory-bukuKnowledge of meaning   an introduction-to_semantic_theory-buku
Knowledge of meaning an introduction-to_semantic_theory-bukuRossi
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingRishikese MR
 

Similar to DMTM 2015 - 18 Text Mining Part 2 (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
https://www.slideshare.net/amaresimachew/hot-topics-132093738
https://www.slideshare.net/amaresimachew/hot-topics-132093738https://www.slideshare.net/amaresimachew/hot-topics-132093738
https://www.slideshare.net/amaresimachew/hot-topics-132093738
 
Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEO
 
lexical-semantics-221118101910-ccd46ac3.pdf
lexical-semantics-221118101910-ccd46ac3.pdflexical-semantics-221118101910-ccd46ac3.pdf
lexical-semantics-221118101910-ccd46ac3.pdf
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AI
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
 
Analyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in PythonAnalyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in Python
 
Written Language in Discourse Analysis
Written Language in Discourse AnalysisWritten Language in Discourse Analysis
Written Language in Discourse Analysis
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Seminar on legal reading, research, writing
Seminar on legal reading, research, writingSeminar on legal reading, research, writing
Seminar on legal reading, research, writing
 
Qualitative Data Analysis I: Text Analysis
Qualitative Data Analysis I: Text AnalysisQualitative Data Analysis I: Text Analysis
Qualitative Data Analysis I: Text Analysis
 
Text as Connected Discourse (HUMSS-11 SENIOR HIGH)
Text as Connected Discourse (HUMSS-11 SENIOR HIGH)Text as Connected Discourse (HUMSS-11 SENIOR HIGH)
Text as Connected Discourse (HUMSS-11 SENIOR HIGH)
 
Knowledge of meaning an introduction-to_semantic_theory-buku
Knowledge of meaning   an introduction-to_semantic_theory-bukuKnowledge of meaning   an introduction-to_semantic_theory-buku
Knowledge of meaning an introduction-to_semantic_theory-buku
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 

More from Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationPier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesPier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsPier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesPier Luca Lanzi
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesPier Luca Lanzi
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationPier Luca Lanzi
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationPier Luca Lanzi
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationPier Luca Lanzi
 

More from Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
 

Recently uploaded

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 

Recently uploaded (20)

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 

DMTM 2015 - 18 Text Mining Part 2

  • 1. Prof. Pier Luca Lanzi Text Mining – Part 2 Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi Association
  • 3. Prof. Pier Luca Lanzi Keyword-Based Association Analysis •  Aims to discover sets of keywords that occur frequently together in the documents •  Relies on the usual techniques for mining associative and correlation rules •  Each document is considered as a transaction of type {document id, {set of keywords}}
  • 4. Prof. Pier Luca Lanzi Why Mining Associations? •  Association mining may discover set of consecutive or closely- located keywords, called terms or phrase § Compound (e.g., {Stanford,University}) § Noncompound (e.g., {dollars,shares,exchange}) •  Once discovered the most frequent terms, term-level mining can be applied most effectively (w.r.t. single word level) •  They are useful for improving accuracy of many NLP tasks § POS tagging, parsing, entity recognition, acronym expansion § Grammar learning 4
  • 5. Prof. Pier Luca Lanzi Why Mining Associations? •  They are useful for improving accuracy of many NLP tasks § POS tagging, parsing, entity recognition, acronym expansion § Grammar learning •  They are directly useful for many applications in text retrieval and mining § Text retrieval (e.g., use word associations to suggest a variation of a query) § Automatic construction of topic map for browsing: words as nodes and associations as edges § Compare and summarize opinions (e.g., what words are most strongly associated with “battery” in positive and negative reviews about iPhone 6, respectively?) 5
  • 6. Prof. Pier Luca Lanzi Basic Word Relations: Paradigmatic vs. Syntagmatic •  Paradigmatic § A B have paradigmatic relation if they can be substituted for each other (i.e., A B are in the same class) § For example, “cat” and “dog”, “Monday” and “Tuesday” •  Syntagmatic § A B have syntagmatic relation if they can be combined with each other (i.e., A B are related semantically) § For example, “cat” and “sit”, “car” and “drive” •  These two basic and complementary relations can be generalized to describe relations of any items in a language 6
  • 7. Prof. Pier Luca Lanzi Mining Paradigmatic Relations
  • 8. Prof. Pier Luca Lanzi Mining Syntagmatic Relations
  • 9. Prof. Pier Luca Lanzi How to Mine Paradigmatic and Syntagmatic Relations? •  Paradigmatic § Represent each word by its context § Compute context similarity § Words with high context similarity likely have paradigmatic relation •  Syntagmatic § Count how many times two words occur together in a context (e.g., sentence or paragraph) § Compare their co-occurrences with their individual occurrences § Words with high co-occurrences but relatively low individual occurrences likely have syntagmatic relation •  Paradigmatically related words tend to have syntagmatic relation with the same word, thus we can join the discovery of the two relations 9
  • 10. Prof. Pier Luca Lanzi Mining Paradigmatic Relations
  • 11. Prof. Pier Luca Lanzi Mining Paradigmatic Relations
  • 12. Prof. Pier Luca Lanzi Sim(“Cat”, “Dog”) = Sim(Left1(“cat”), Left1(“dog”)) + Sim(Right1(“cat”), Right1(“dog”)) + … + Sim(Window8(“cat”), Window8(“dog”))=? High value of Sim(word1, word2) implies that word1 and word2 are paradigmatically related
  • 13. Prof. Pier Luca Lanzi Adapting BM25 to Paradigmatic Relation Mining 13
  • 14. Prof. Pier Luca Lanzi Mining Syntagmatic Relations
  • 15. Prof. Pier Luca Lanzi Mining Syntagmatic Relations
  • 16. Prof. Pier Luca Lanzi Predicting Words •  We want to predict whether the word W is present or absent in a text segment •  Some words are easier to predict than others § W = “the” (easy because frequent) § W = “unicorn” (easy because rare) § W = “meat” (difficult because not frequent nor rare) •  Formal definition § Binary random variable Xw is 1 if w is present 0 otherwise § p(Xw=1) + p(Xw=0) = 1 •  The more random Xw is the more difficult to predict it is 16 How can we measure the “randomness” of Xw? J
  • 17. Prof. Pier Luca Lanzi Conditional Entropy •  The entropy of Xw measures the randomness of predicting w § The lower the entropy the easier to predict § The higher the entropy the more difficult to predict •  Suppose now that “eat” is present in the segment and that we need to predict the presence of “meat” •  Questions § Does the presence of “eat” help predicting the presence of “meat”? § Does it reduces the uncertainty about “meat”? § What if “eat” is not present? 17
  • 18. Prof. Pier Luca Lanzi Complete Formula
  • 19. Prof. Pier Luca Lanzi In general, for any discrete random variable we have that H(X) ≥ H(X|Y) What is the minimum possible value of H(X|Y)? Which one is smaller? H(Xmeat|Xthe) or H(Xmeat|Xeat)?
  • 20. Prof. Pier Luca Lanzi Mining Syntagmatic Relations using Entropy •  For each word W1 § For every other word W2 compute H(XW1|XW2) § Sort all the candidates in ascending order of H(XW1|XW2) § Take the top-ranked candidate words as words that have potential syntagmatic relations with W1 § Need to use a threshold for each W1 •  However, while H(XW1|XW2) and H(XW1|XW3) are comparable, H(XW1|XW2) and H(XW3|XW2) aren’t! 20
  • 21. Prof. Pier Luca Lanzi Text Classification
  • 22. Prof. Pier Luca Lanzi Document classification •  Solve the problem of labeling automatically text documents on the basis of § Topic § Style § Purpose •  Usual classification techniques can be used to learn from a training set of manually labeled documents •  Which features? Keywords can be thousands… •  Major approaches § Similarity-based § Dimensionality reduction § Naïve Bayes text classifiers 22
  • 23. Prof. Pier Luca Lanzi Similarity-based Text Classifiers •  Exploits Information Retrieval and k-nearest-neighbor classifier § For a new document to classify, the k most similar documents in the training set are retrieved § Documents are classified on the basis of the class distribution among the k documents retrieved using the majority vote or a weighted vote § Tuning k is very important to achieve a good performance •  Limitations § Space overhead to store all the documents in training set § Time overhead to retrieve the similar documents 23
  • 24. Prof. Pier Luca Lanzi Dimensionality Reduction for Text Classification •  As in the Vector Space Model, the goal is to reduce the number of features to represents text •  Usual dimensionality reduction approaches in Information Retrieval are based on the distribution of keywords among the whole documents database •  In text classification it is important to consider also the correlation between keywords and classes § Rare keywords have a high TF-IDF but might be uniformly distributed among classes § LSA and LPA do not take into account classes distributions •  Usual classification techniques can be then applied on reduced features space are SVMs and Naïve Bayes classifiers 24
  • 25. Prof. Pier Luca Lanzi •  Document categories {C1 , …, Cn} •  Document to classify D •  Probabilistic model: •  We chose the class C* such that •  Issues § Which features? § How to compute the probabilities? Naïve Bayes for Text Classification ( | ) ( ) ( | ) ( ) i i i P D C P C P C D P D = 25
  • 26. Prof. Pier Luca Lanzi Naïve Bayes for Text Classification •  Features can be simply defined as the words in the document •  Let ai be a keyword in the doc, and wj a word in the vocabulary, we get: Example H={like,dislike} D= “Our approach to representing arbitrary text documents is disturbingly simple” 26
  • 27. Prof. Pier Luca Lanzi Naïve Bayes for Text Classification •  Features can be simply defined as the words in the document •  Let ai be a keyword in the doc, and wj a word in the vocabulary, •  Assumptions § Keywords distributions are inter-independent § Keywords distributions are order-independent 27
  • 28. Prof. Pier Luca Lanzi Naïve Bayes for Text Classification •  How to compute probabilities? § Simply counting the occurrences may lead to wrong results when probabilities are small •  M-estimate approach adapted for text: § Nc is the whole number of word positions in documents of class C § Nc,k is the number of occurrences of wk in documents of class C § |Vocabulary| is the number of distinct words in training set § Uniform priors are assumed 28
  • 29. Prof. Pier Luca Lanzi Naïve Bayes for Text Classification •  Final classification is performed as •  Despite its simplicity Naïve Bayes classifiers works very well in practice •  Applications § Newsgroup post classification § NewsWeeder (news recommender)
  • 30. Prof. Pier Luca Lanzi Clustering Text
  • 31. Prof. Pier Luca Lanzi Text Clustering •  It involves the same steps as other text mining algorithms to preprocess the text data into an adequate representation § Tokenizing and stemming each synopsis § Transforming the corpus into vector space using TF-IDF § Calculating cosine distance between each document as a measure of similarity § Clustering the documents using the similarity measure •  Everything works as with the usual data except that text requires a rather long preprocessing. 31
  • 33. Prof. Pier Luca Lanzi Weka DEMO •  Movie Review Data § http://www.cs.cornell.edu/people/pabo/movie-review-data/ •  Load Text Data in Weka (CLI) 33
  • 34. Prof. Pier Luca Lanzi Weka DEMO (2) •  From text to word vectors: StringToWordVector •  Naïve Bayes Classifier •  Stoplist § https://code.google.com/p/stop-words/ •  Attribute Selection 34