SlideShare a Scribd company logo
1 of 54
Download to read offline
Prof. Pier Luca Lanzi
Text Mining – Part 1
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
Prof. Pier Luca Lanzi(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
Prof. Pier Luca Lanzi(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
Prof. Pier Luca Lanzi(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
Prof. Pier Luca Lanzi
Natural Language Processing (NLP)
Prof. Pier Luca Lanzi
Why Natural Language Processing?
Gov. Schwarzenegger helps inaugurate pricey new bridge
approach
The Associated Press
Article Launched: 04/11/2008 01:40:31 PM PDT
SAN FRANCISCO—It briefly looked like a scene out of a Terminator movie, with Governor
Arnold Schwarzenegger standing in the middle of San Francisco wielding a blow-torch in his
hands. Actually, the governor was just helping to inaugurate a new approach to the San Francisco-
Oakland Bay Bridge.
Caltrans thinks the new approach will make it faster for commuters to get on the bridge from the
San Francisco side.
The new section of the highway is scheduled to open tomorrow morning and cost 429 million
dollars to construct.
• Schwarzenegger
• Bridge
• Caltrans
• Governor
• Scene
• Terminator
•  Does it article talks about entertainment
or politics?
•  Using just the words (bag of words
model) has severe limitations
7
Prof. Pier Luca Lanzi
Natural Language Processing
A dog is chasing a boy on the playground
Det Noun Aux Verb Det Noun Prep Det Noun
Noun Phrase Complex Verb Noun Phrase
Noun Phrase
Prep Phrase
Verb Phrase
Verb Phrase
Sentence
Dog(d1).
Boy(b1).
Playground(p1).
Chasing(d1,b1,p1).
Semantic analysis
Lexical
analysis
(part-of-speech
tagging)
Syntactic analysis
(Parsing)
A person saying this may
be reminding another person to
get the dog back…
Pragmatic analysis
(speech act)
Scared(x) if Chasing(_,x,_).
+
Scared(b1)
Inference
(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
8
Prof. Pier Luca Lanzi
Natural Language Processing is Difficult!
•  Natural language is designed to make human communication
efficient. As a result,
§ We omit a lot of common sense knowledge, which we
assume the hearer/reader possesses.
§ We keep a lot of ambiguities, which we assume the 
hearer/reader knows how to resolve.
•  This makes every step of NLP very difficult
§ Ambiguity
§ Common sense reasoning
9
Prof. Pier Luca Lanzi
What the Difficulties?
•  Word-level Ambiguity
§ “design” can be a verb or a noun
§ “root” has multiple meaning
•  Syntactic Ambiguity
§ Natural language processing
§ A man saw a boy with a telescope
•  Presupposition
§ “He has quit smoking” implies he smoked
•  Text Mining NLP Approach
§ Locate promising fragments using fast methods 
(bag-of-tokens)
§ Only apply slow NLP techniques to promising fragments
10
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
State of the Art?
A dog is chasing a boy on the playground
Det Noun Aux Verb Det Noun Prep Det Noun
Noun Phrase Complex Verb Noun Phrase
Noun Phrase
Prep Phrase
Verb Phrase
Verb Phrase
Sentence
Semantic analysis
(some aspects)
Entity-Relation Extraction
Word sense disambiguation
Sentiment analysis
…
Inference?
POS Tagging
97%
Parsing  90%
(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
12
Speech act?
Prof. Pier Luca Lanzi
What the Open Problems?
•  100% POS tagging
§ “he turned off the highway” vs “he turned off the light”
•  General complete parsing
§ “a man saw a boy with a telescope”
•  Precise deep semantic analysis
•  Robust and general NLP methods tend to be shallow while deep
understanding does not scale up easily
13
Prof. Pier Luca Lanzi
14
Prof. Pier Luca Lanzi
Information Retrieval
Prof. Pier Luca Lanzi
Information Retrieval Systems
•  Information retrieval deals with the problem of locating relevant
documents with respect to the user input or preference
•  Typical systems
§ Online library catalogs
§ Online document management systems
•  Typical issues
§ Management of unstructured documents
§ Approximate search
§ Relevance
16
Prof. Pier Luca Lanzi
Two Modes of Text Access
•  Pull Mode (search engines)
§ Users take initiative
§ Ad hoc information need
•  Push Mode (recommender systems)
§ Systems take initiative
§ Stable information need or system has good knowledge about
a user’s need
17
Prof. Pier Luca Lanzi
2/)( precisionrecall
precisionrecall
Fscore
+
×
=
|}{|
|}{}{|
Relevant
RetrievedRelevant
recall
∩
=
|}{|
|}{}{|
Retrieved
RetrievedRelevant
precision
∩
=
Relevant RR Retrieved
All Documents
Prof. Pier Luca Lanzi
Text Retrieval Methods
•  Document Selection (keyword-based retrieval)
§ Query defines a set of requisites
§ Only the documents that satisfy the query are returned
§ A typical approach is the Boolean Retrieval Model
•  Document Ranking (similarity-based retrieval)
§ Documents are ranked on the basis of their relevance with
respect to the user query
§ For each document a “degree of relevance” is computed with
respect to the query
§ A typical approach is the Vector Space Model
19
Prof. Pier Luca Lanzi
•  A document and a query are represented as vectors in high-
dimensional space corresponding to all the keywords
•  Relevance is measured with an appropriate similarity measure
defined over the vector space
•  Issues
§ How to select keywords to capture “basic concepts” ?
§ How to assign weights to each term?
§ How to measure the similarity?
Vector Space Model 20
Prof. Pier Luca Lanzi
Bag of Words
Prof. Pier Luca Lanzi
Boolean Retrieval Model
•  A query is composed of keywords linked by the three logical
connectives: not, and, or
•  For example, “car and repair”, “plane or airplane”
•  In the Boolean model each document is either relevant or non-
relevant, depending it matches or not the query
•  Limitations
§ Generally not suitable to satisfy information need
§ Useful only in very specific domain where users have a big
expertise
22
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Java
Microsoft
Starbucks
Query
DOC
d = (0,1,1)
q = (1,1,1)
Prof. Pier Luca Lanzi
Document Ranking
•  Query q = q1, …, qm where qi is a word
•  Document d = d1, …, dn where di is a word (bag of words model)
•  Ranking function f(q, d) which returns a real value
•  A good ranking function should rank relevant documents on top of
non-relevant ones
•  The key challenge is how to measure the likelihood that document d is
relevant to query q
•  The retrieval model gives a formalization of relevance in computational
terms
25
Prof. Pier Luca Lanzi
Retrieval Models
•  Similarity-based models: f(q,d) = similarity(q,d)
§ Vector space model
•  Probabilistic models: f(q,d) = p(R=1|d,q)
§ Classic probabilistic model
§ Language model
§ Divergence-from-randomness model
•  Probabilistic inference model: f(q,d) = p(d-q)
•  Axiomatic model: f(q,d) must satisfy a set of constraints
26
Prof. Pier Luca Lanzi
BasicVector Space Model (VSM)
Prof. Pier Luca Lanzi
How WouldYou RankThese Docs?
Prof. Pier Luca Lanzi
Example with BasicVSM
Prof. Pier Luca Lanzi
BasicVSM returns the number of distinct query words matched in d.
Prof. Pier Luca Lanzi
Two Problems of BasicVSM
Prof. Pier Luca Lanzi
Term Frequency Weighting (TFW)
Prof. Pier Luca Lanzi
Term Frequency Weighting (TFW)
Prof. Pier Luca Lanzi
Ranking usingTFW
Prof. Pier Luca Lanzi
“presidential” vs “about”
Prof. Pier Luca Lanzi
Adding Inverse Document Frequency
Prof. Pier Luca Lanzi
Inverse Document Frequency
Prof. Pier Luca Lanzi
“presidential” vs “about”
Prof. Pier Luca Lanzi
Inverse Document Frequency Ranking
Prof. Pier Luca Lanzi
CombiningTerm Frequency (TF) with
Inverse Document Frequency Ranking
Prof. Pier Luca Lanzi
TFTransformation
Prof. Pier Luca Lanzi
BM25Transformation
Prof. Pier Luca Lanzi
Ranking with BM25-TF
Prof. Pier Luca Lanzi
What about Document Length?
Prof. Pier Luca Lanzi
Document Length Normalization 
•  Penalize a long doc with a doc length normalizer
§ Long doc has a better chance to match any query
§ Need to avoid over-penalization
•  A document is long because
§ it uses more words, then it should be more penalized
§ it has more contents, then it should be penalized less
•  Pivoted length normalizer
§ It uses average doc length as “pivot”
§ The normalizer is 1 if the doc length (|d|) is equal to
the average doc length (avdl)
45
Prof. Pier Luca Lanzi
Pivot Length Normalizer
Prof. Pier Luca Lanzi
State of the Art VSM Ranking Function
•  Pivoted Length Normalization VSM [Singhal et al 96]
•  BM25/Okapi [Robertson  Walker 94]
47
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Preprocessing
Prof. Pier Luca Lanzi
Keywords Selection
•  Text is preprocessed through tokenization
•  Stop word list and word stemming are used to isolate and thus identify
significant keywords
•  Stop words elimination
§ Stop words are elements that are considered uninteresting with
respect to the retrieval and thus are eliminated
§ For instance, “a”, “the”, “always”, “along”
•  Word stemming
§ Different words that share a common prefix are simplified and
replaced by their common prefix
§ For instance, “computer”, “computing”, “computerize” are replaced
by “comput”
50
Prof. Pier Luca Lanzi
Dimensionality Reduction
•  Approaches presented so far involves high dimensional space
(huge number of keywords)
§ Computationally expensive
§ Difficult to deal with synonymy and polysemy problems
§ “vehicle” is similar to “car”
§ “mining” has different meanings in different contexts
•  Dimensionality reduction techniques
§ Latent Semantic Indexing (LSI)
§ Locality Preserving Indexing (LPI)
§ Probabilistic Semantic Indexing (PLSI)
51
Prof. Pier Luca Lanzi
Latent Semantic Indexing (LSI)
•  Let xi be vectors representing documents and X
(the term frequency matrix) the all set of documents:
•  Let use the singular value decomposition (SVD) to reduce the
size of frequency table:
•  Approximate X with Xk that is obtained from the first k vectors
of U
•  It can be shown that such transformation minimizes the error for
the reconstruction of X
52
Prof. Pier Luca Lanzi
Locality Preserving Analysis (LPA) (or Indexing)
•  The goal is to preserve the locality information
•  Two documents close in the original space should be close also in
the transformed space
•  More formally,
•  Similarity matrix:
Set of Documents Similarity Matrix
Optimal transformation
53
Prof. Pier Luca Lanzi
Probabilistic Latent Semantic Analysis (PLSA)
•  Similar to LSI but does not apply SVD to identify the k most
relevant features
•  The assumption is that all the documents have 
“k common themes”
•  Word distribution in documents can be modeled as
•  Mixing weights are identified with Expectation-Maximization (EM)
algorithms and define new representation of the documents
Theme
distributions
Mixing weights
54

More Related Content

What's hot

DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationPier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningPier Luca Lanzi
 
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsDMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsPier Luca Lanzi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationPier Luca Lanzi
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 RegressionPier Luca Lanzi
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsPier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesPier Luca Lanzi
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结君 廖
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationPier Luca Lanzi
 

What's hot (20)

DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsDMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
 

Viewers also liked

DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsPier Luca Lanzi
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesPier Luca Lanzi
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesPier Luca Lanzi
 
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesDMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesPier Luca Lanzi
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringPier Luca Lanzi
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringPier Luca Lanzi
 
Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Pier Luca Lanzi
 
DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningPier Luca Lanzi
 
DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionPier Luca Lanzi
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesPier Luca Lanzi
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesPier Luca Lanzi
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringPier Luca Lanzi
 
Idea Generation and Conceptualization
Idea Generation and ConceptualizationIdea Generation and Conceptualization
Idea Generation and ConceptualizationPier Luca Lanzi
 
Text Mining Using JBoss Rules
Text Mining Using JBoss RulesText Mining Using JBoss Rules
Text Mining Using JBoss RulesMark Maslyn
 
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Pier Luca Lanzi
 
Working with Formal Elements
Working with Formal ElementsWorking with Formal Elements
Working with Formal ElementsPier Luca Lanzi
 

Viewers also liked (20)

DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision Trees
 
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesDMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification Rules
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
Course Introduction
Course IntroductionCourse Introduction
Course Introduction
 
Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016
 
DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data Mining
 
DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course Introduction
 
Course Organization
Course OrganizationCourse Organization
Course Organization
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association Rules
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
Idea Generation and Conceptualization
Idea Generation and ConceptualizationIdea Generation and Conceptualization
Idea Generation and Conceptualization
 
Text Mining Using JBoss Rules
Text Mining Using JBoss RulesText Mining Using JBoss Rules
Text Mining Using JBoss Rules
 
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
 
Working with Formal Elements
Working with Formal ElementsWorking with Formal Elements
Working with Formal Elements
 
The Structure of Games
The Structure of GamesThe Structure of Games
The Structure of Games
 
The Design Document
The Design DocumentThe Design Document
The Design Document
 

Similar to DMTM 2015 - 17 Text Mining Part 1

Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹台灣資料科學年會
 
Scientific and Technical Translation in English: Week 2
Scientific and Technical Translation in English: Week 2Scientific and Technical Translation in English: Week 2
Scientific and Technical Translation in English: Week 2Ron Martinez
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Zachary S. Brown
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with LuceneKai Chan
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Tprs in the chinese classroom
Tprs in the chinese classroomTprs in the chinese classroom
Tprs in the chinese classroomChinese Teachers
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptOlusolaTop
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with LuceneKai Chan
 
Natural language processing
Natural language processingNatural language processing
Natural language processingBasha Chand
 
Natural Language Processing: Lecture 255
Natural Language Processing: Lecture 255Natural Language Processing: Lecture 255
Natural Language Processing: Lecture 255deffa5
 

Similar to DMTM 2015 - 17 Text Mining Part 1 (20)

Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
 
Scientific and Technical Translation in English: Week 2
Scientific and Technical Translation in English: Week 2Scientific and Technical Translation in English: Week 2
Scientific and Technical Translation in English: Week 2
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Tprs in the chinese classroom
Tprs in the chinese classroomTprs in the chinese classroom
Tprs in the chinese classroom
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.ppt
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing: Lecture 255
Natural Language Processing: Lecture 255Natural Language Processing: Lecture 255
Natural Language Processing: Lecture 255
 

More from Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringPier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesPier Luca Lanzi
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesPier Luca Lanzi
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationPier Luca Lanzi
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionPier Luca Lanzi
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningPier Luca Lanzi
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelinePier Luca Lanzi
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityPier Luca Lanzi
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationPier Luca Lanzi
 

More from Pier Luca Lanzi (19)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 Introduction
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with Unity
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generation
 

Recently uploaded

URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 

Recently uploaded (20)

URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 

DMTM 2015 - 17 Text Mining Part 1

  • 1. Prof. Pier Luca Lanzi Text Mining – Part 1 Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
  • 3. Prof. Pier Luca Lanzi(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
  • 4. Prof. Pier Luca Lanzi(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
  • 5. Prof. Pier Luca Lanzi(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
  • 6. Prof. Pier Luca Lanzi Natural Language Processing (NLP)
  • 7. Prof. Pier Luca Lanzi Why Natural Language Processing? Gov. Schwarzenegger helps inaugurate pricey new bridge approach The Associated Press Article Launched: 04/11/2008 01:40:31 PM PDT SAN FRANCISCO—It briefly looked like a scene out of a Terminator movie, with Governor Arnold Schwarzenegger standing in the middle of San Francisco wielding a blow-torch in his hands. Actually, the governor was just helping to inaugurate a new approach to the San Francisco- Oakland Bay Bridge. Caltrans thinks the new approach will make it faster for commuters to get on the bridge from the San Francisco side. The new section of the highway is scheduled to open tomorrow morning and cost 429 million dollars to construct. • Schwarzenegger • Bridge • Caltrans • Governor • Scene • Terminator •  Does it article talks about entertainment or politics? •  Using just the words (bag of words model) has severe limitations 7
  • 8. Prof. Pier Luca Lanzi Natural Language Processing A dog is chasing a boy on the playground Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Complex Verb Noun Phrase Noun Phrase Prep Phrase Verb Phrase Verb Phrase Sentence Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). Semantic analysis Lexical analysis (part-of-speech tagging) Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back… Pragmatic analysis (speech act) Scared(x) if Chasing(_,x,_). + Scared(b1) Inference (Taken from ChengXiang Zhai, CS 397cxz – Fall 2003) 8
  • 9. Prof. Pier Luca Lanzi Natural Language Processing is Difficult! •  Natural language is designed to make human communication efficient. As a result, § We omit a lot of common sense knowledge, which we assume the hearer/reader possesses. § We keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve. •  This makes every step of NLP very difficult § Ambiguity § Common sense reasoning 9
  • 10. Prof. Pier Luca Lanzi What the Difficulties? •  Word-level Ambiguity § “design” can be a verb or a noun § “root” has multiple meaning •  Syntactic Ambiguity § Natural language processing § A man saw a boy with a telescope •  Presupposition § “He has quit smoking” implies he smoked •  Text Mining NLP Approach § Locate promising fragments using fast methods (bag-of-tokens) § Only apply slow NLP techniques to promising fragments 10
  • 12. Prof. Pier Luca Lanzi State of the Art? A dog is chasing a boy on the playground Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Complex Verb Noun Phrase Noun Phrase Prep Phrase Verb Phrase Verb Phrase Sentence Semantic analysis (some aspects) Entity-Relation Extraction Word sense disambiguation Sentiment analysis … Inference? POS Tagging 97% Parsing 90% (Taken from ChengXiang Zhai, CS 397cxz – Fall 2003) 12 Speech act?
  • 13. Prof. Pier Luca Lanzi What the Open Problems? •  100% POS tagging § “he turned off the highway” vs “he turned off the light” •  General complete parsing § “a man saw a boy with a telescope” •  Precise deep semantic analysis •  Robust and general NLP methods tend to be shallow while deep understanding does not scale up easily 13
  • 14. Prof. Pier Luca Lanzi 14
  • 15. Prof. Pier Luca Lanzi Information Retrieval
  • 16. Prof. Pier Luca Lanzi Information Retrieval Systems •  Information retrieval deals with the problem of locating relevant documents with respect to the user input or preference •  Typical systems § Online library catalogs § Online document management systems •  Typical issues § Management of unstructured documents § Approximate search § Relevance 16
  • 17. Prof. Pier Luca Lanzi Two Modes of Text Access •  Pull Mode (search engines) § Users take initiative § Ad hoc information need •  Push Mode (recommender systems) § Systems take initiative § Stable information need or system has good knowledge about a user’s need 17
  • 18. Prof. Pier Luca Lanzi 2/)( precisionrecall precisionrecall Fscore + × = |}{| |}{}{| Relevant RetrievedRelevant recall ∩ = |}{| |}{}{| Retrieved RetrievedRelevant precision ∩ = Relevant RR Retrieved All Documents
  • 19. Prof. Pier Luca Lanzi Text Retrieval Methods •  Document Selection (keyword-based retrieval) § Query defines a set of requisites § Only the documents that satisfy the query are returned § A typical approach is the Boolean Retrieval Model •  Document Ranking (similarity-based retrieval) § Documents are ranked on the basis of their relevance with respect to the user query § For each document a “degree of relevance” is computed with respect to the query § A typical approach is the Vector Space Model 19
  • 20. Prof. Pier Luca Lanzi •  A document and a query are represented as vectors in high- dimensional space corresponding to all the keywords •  Relevance is measured with an appropriate similarity measure defined over the vector space •  Issues § How to select keywords to capture “basic concepts” ? § How to assign weights to each term? § How to measure the similarity? Vector Space Model 20
  • 21. Prof. Pier Luca Lanzi Bag of Words
  • 22. Prof. Pier Luca Lanzi Boolean Retrieval Model •  A query is composed of keywords linked by the three logical connectives: not, and, or •  For example, “car and repair”, “plane or airplane” •  In the Boolean model each document is either relevant or non- relevant, depending it matches or not the query •  Limitations § Generally not suitable to satisfy information need § Useful only in very specific domain where users have a big expertise 22
  • 24. Prof. Pier Luca Lanzi Java Microsoft Starbucks Query DOC d = (0,1,1) q = (1,1,1)
  • 25. Prof. Pier Luca Lanzi Document Ranking •  Query q = q1, …, qm where qi is a word •  Document d = d1, …, dn where di is a word (bag of words model) •  Ranking function f(q, d) which returns a real value •  A good ranking function should rank relevant documents on top of non-relevant ones •  The key challenge is how to measure the likelihood that document d is relevant to query q •  The retrieval model gives a formalization of relevance in computational terms 25
  • 26. Prof. Pier Luca Lanzi Retrieval Models •  Similarity-based models: f(q,d) = similarity(q,d) § Vector space model •  Probabilistic models: f(q,d) = p(R=1|d,q) § Classic probabilistic model § Language model § Divergence-from-randomness model •  Probabilistic inference model: f(q,d) = p(d-q) •  Axiomatic model: f(q,d) must satisfy a set of constraints 26
  • 27. Prof. Pier Luca Lanzi BasicVector Space Model (VSM)
  • 28. Prof. Pier Luca Lanzi How WouldYou RankThese Docs?
  • 29. Prof. Pier Luca Lanzi Example with BasicVSM
  • 30. Prof. Pier Luca Lanzi BasicVSM returns the number of distinct query words matched in d.
  • 31. Prof. Pier Luca Lanzi Two Problems of BasicVSM
  • 32. Prof. Pier Luca Lanzi Term Frequency Weighting (TFW)
  • 33. Prof. Pier Luca Lanzi Term Frequency Weighting (TFW)
  • 34. Prof. Pier Luca Lanzi Ranking usingTFW
  • 35. Prof. Pier Luca Lanzi “presidential” vs “about”
  • 36. Prof. Pier Luca Lanzi Adding Inverse Document Frequency
  • 37. Prof. Pier Luca Lanzi Inverse Document Frequency
  • 38. Prof. Pier Luca Lanzi “presidential” vs “about”
  • 39. Prof. Pier Luca Lanzi Inverse Document Frequency Ranking
  • 40. Prof. Pier Luca Lanzi CombiningTerm Frequency (TF) with Inverse Document Frequency Ranking
  • 41. Prof. Pier Luca Lanzi TFTransformation
  • 42. Prof. Pier Luca Lanzi BM25Transformation
  • 43. Prof. Pier Luca Lanzi Ranking with BM25-TF
  • 44. Prof. Pier Luca Lanzi What about Document Length?
  • 45. Prof. Pier Luca Lanzi Document Length Normalization •  Penalize a long doc with a doc length normalizer § Long doc has a better chance to match any query § Need to avoid over-penalization •  A document is long because § it uses more words, then it should be more penalized § it has more contents, then it should be penalized less •  Pivoted length normalizer § It uses average doc length as “pivot” § The normalizer is 1 if the doc length (|d|) is equal to the average doc length (avdl) 45
  • 46. Prof. Pier Luca Lanzi Pivot Length Normalizer
  • 47. Prof. Pier Luca Lanzi State of the Art VSM Ranking Function •  Pivoted Length Normalization VSM [Singhal et al 96] •  BM25/Okapi [Robertson Walker 94] 47
  • 49. Prof. Pier Luca Lanzi Preprocessing
  • 50. Prof. Pier Luca Lanzi Keywords Selection •  Text is preprocessed through tokenization •  Stop word list and word stemming are used to isolate and thus identify significant keywords •  Stop words elimination § Stop words are elements that are considered uninteresting with respect to the retrieval and thus are eliminated § For instance, “a”, “the”, “always”, “along” •  Word stemming § Different words that share a common prefix are simplified and replaced by their common prefix § For instance, “computer”, “computing”, “computerize” are replaced by “comput” 50
  • 51. Prof. Pier Luca Lanzi Dimensionality Reduction •  Approaches presented so far involves high dimensional space (huge number of keywords) § Computationally expensive § Difficult to deal with synonymy and polysemy problems § “vehicle” is similar to “car” § “mining” has different meanings in different contexts •  Dimensionality reduction techniques § Latent Semantic Indexing (LSI) § Locality Preserving Indexing (LPI) § Probabilistic Semantic Indexing (PLSI) 51
  • 52. Prof. Pier Luca Lanzi Latent Semantic Indexing (LSI) •  Let xi be vectors representing documents and X (the term frequency matrix) the all set of documents: •  Let use the singular value decomposition (SVD) to reduce the size of frequency table: •  Approximate X with Xk that is obtained from the first k vectors of U •  It can be shown that such transformation minimizes the error for the reconstruction of X 52
  • 53. Prof. Pier Luca Lanzi Locality Preserving Analysis (LPA) (or Indexing) •  The goal is to preserve the locality information •  Two documents close in the original space should be close also in the transformed space •  More formally, •  Similarity matrix: Set of Documents Similarity Matrix Optimal transformation 53
  • 54. Prof. Pier Luca Lanzi Probabilistic Latent Semantic Analysis (PLSA) •  Similar to LSI but does not apply SVD to identify the k most relevant features •  The assumption is that all the documents have “k common themes” •  Word distribution in documents can be modeled as •  Mixing weights are identified with Expectation-Maximization (EM) algorithms and define new representation of the documents Theme distributions Mixing weights 54