SlideShare a Scribd company logo
1
Phrase Mining and Topic
Modeling from Large Corpora
JIAWEI HAN
COMPUTER SCIENCE
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
JUNE 16, 2016
2
3
Why Phrase Mining?
 Unigrams vs. phrases
 Unigrams (single words) are ambiguous
Example: “United”: United States? United Airline? United Parcel Service?
 Phrase: A natural, meaningful, unambiguous semantic unit
Example: “United States” vs. “United Airline”
 Mining semantically meaningful phrases
 Transform text data from word granularity to phrase granularity
 Enhance the power and efficiency at manipulating unstructured data using
database technology
4
 Phrase mining: Originated from the NLP community—“Chunking”
 Model it as a sequence labeling problem (B-NP, I-NP, O, …)
 Need annotation and training
 Annotate hundreds of documents as training data
 Train a supervised model based on part-of-speech features
 Recent trend:
 Use distributional features based on web n-grams (Bergsma et al., 2010)
 State-of-the-art performance: ~95% accuracy, ~88% phrase-level F-score
 Limitations
 High annotation cost, not scalable to a new language, a new domain or genre
 May not fit domain-specific, dynamic, emerging applications
 Scientific domains, query logs, or social media, e.g., Yelp, Twitter
Mining Phrases: Why Not Use NLP Methods?
5
Data Mining Approaches
 General principle: Fully exploit information redundancy and data-
driven criteria to determine phrase boundaries and salience
 Phrase Mining and Topic Modeling from Large Corpora
 Strategy 1: Simultaneously Inferring Phrases and Topics
 Strategy 2: Post Topic Modeling Phrase Construction
 Strategy 3: First Phrase Mining then Topic Modeling (ToPMine)
 Integration of Phrase Mining with Document Segmentation
Session 1. Frequent Pattern
Mining for Text Data
7
Frequent Pattern Mining for Text Data: Phrase
Mining and Topic Modeling
 Motivation: Unigrams (single words) can be difficult to interpret
 Ex.: The topic that represents the area of Machine Learning
learning
reinforcement
support
machine
vector
selection
feature
random
:
versus
learning
support vector machines
reinforcement learning
feature selection
conditional random fields
classification
decision trees
:
8
Various Strategies: Phrase-Based Topic Modeling
 Strategy 1: Generate bag-of-words → generate sequence of tokens
 Bigram topical model [Wallach’06], topical n-gram model [Wang, et al.’07],
phrase discovering topic model [Lindsey, et al.’12]
 Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
 Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky, et
al.’14]
 Strategy 3: Prior bag-of-words model inference, mine phrases and impose on
the bag-of-words model
 ToPMine [El-kishky, et al.’15]
Session 2. Strategy 1:
Simultaneously Inferring
Phrases and Topics
10
Strategy 1: Simultaneously Inferring Phrases and Topics
 Bigram Topic Model [Wallach’06]
 Probabilistic generative model that conditions on previous word and topic when
drawing next word
 Topical N-Grams (TNG) [Wang, et al.’07]
 Probabilistic model that generates words in textual order
 Create n-grams by concatenating successive bigrams (a generalization of Bigram
Topic Model)
 Phrase-Discovering LDA (PDLDA) [Lindsey, et al.’12]
 Viewing each sentence as a time-series of words, PDLDA posits that the
generative parameter (topic) changes periodically
 Each word is drawn based on previous m words (context) and current phrase topic
 High model complexity: Tends to overfitting; High inference cost: Slow
11
TNG: Experiments on Research Papers
12
TNG: Experiments on Research Papers (2)
Session 3. Strategy 2: Post
Topic Modeling Phrase
Construction
14
Strategy 2: Post Topic Modeling Phrase Construction
 TurboTopics [Blei & Lafferty’09] – Phrase construction as a post-processing step
to Latent Dirichlet Allocation
 Perform Latent Dirichlet Allocation on corpus to assign each token a topic label
 Merge adjacent unigrams with the same topic label by a distribution-free
permutation test on arbitrary-length back-off model
 End recursive merging when all significant adjacent unigrams have been
merged
 KERT [Danilevsky et al.’14] – Phrase construction as a post-processing step
to Latent Dirichlet Allocation
 Perform frequent pattern mining on each topic
 Perform phrase ranking based on four different criteria
15
Example of TurboTopics
 Perform LDA on corpus to assign each token a topic label
 E.g., … phase11 transition11 …. game153 theory127 …
 Then merge adjacent unigrams with same topic label
16
KERT: Topical Keyphrase Extraction & Ranking
[Danilevsky, et al. 2014]
learning
support vector machines
reinforcement learning
feature selection
conditional random fields
classification
decision trees
:
Topical keyphrase
extraction & ranking
knowledge discovery using least squares support vector machine classifiers
support vectors for reinforcement learning
a hybrid approach to feature selection
pseudo conditional random fields
automatic web page classification in a dynamic and hierarchical way
inverse time dependency in convex regularized learning
postprocessing decision trees to extract actionable knowledge
variance minimization least squares support vector machines
…
Unigram topic assignment: Topic 1 & Topic 2
17
Framework of KERT
1. Run bag-of-words model inference and assign topic label to each token
2. Extract candidate keyphrases within each topic
3. Rank the keyphrases in each topic
 Popularity: ‘information retrieval’ vs. ‘cross-language information retrieval’
 Discriminativeness: only frequent in documents about topic t
 Concordance: ‘active learning’ vs.‘learning classification’
 Completeness: ‘vector machine’ vs. ‘support vector machine’
Frequent pattern mining
Comparability property: directly compare phrases of mixed lengths
18
Comparison of phrase ranking methods
The topic that represents the area of Machine Learning
kpRel
[Zhao et al. 11]
KERT
(-popularity)
KERT
(-discriminativeness)
KERT
(-concordance)
KERT
[Danilevsky et al. 14]
learning effective support vector machines learning learning
classification text feature selection classification support vector machines
selection probabilistic reinforcement learning selection reinforcement learning
models identification conditional random fields feature feature selection
algorithm mapping constraint satisfaction decision conditional random fields
features task decision trees bayesian classification
decision planning dimensionality reduction trees decision trees
: : : : :
KERT: Topical Phrases on Machine Learning
Top-Ranked Phrases by Mining Paper Titles in DBLP
Session 4. Strategy 3: First
Phrase Mining then Topic
Modeling
20
Strategy 3: First Phrase Mining then Topic Modeling
 ToPMine [El-Kishky et al. VLDB’15]
 First phrase construction, then topic mining
 Contrast with KERT: topic modeling, then phrase mining
 The ToPMine Framework:
 Perform frequent contiguous pattern mining to extract candidate phrases and
their counts
 Perform agglomerative merging of adjacent unigrams as guided by a significance
score—This segments each document into a “bag-of-phrases”
 The newly formed bag-of-phrases are passed as input to PhraseLDA, an extension
of LDA, that constrains all words in a phrase to each sharing the same latent topic
21
Why First Phrase Mining then Topic Modeling ?
 With Strategy 2, tokens in the same phrase may be assigned to different topics
 Ex. knowledge discovery using least squares support vector machine classifiers…
 Knowledge discovery and support vector machine should have coherent topic labels
 Solution: switch the order of phrase mining and topic model inference
 Techniques
 Phrase mining and document segmentation
 Topic model inference with phrase constraint
[knowledge discovery] using [least
squares] [support vector machine]
[classifiers] …
[knowledge discovery] using [least
squares] [support vector machine]
[classifiers] …
22
Phrase Mining: Frequent Pattern Mining + Statistical
Analysis
[Markov blanket] [feature selection] for [support vector
machines]
[knowledge discovery] using [least squares] [support
vector machine] [classifiers]
…[support vector] for [machine learning]…
Phrase Raw
freq.
True
freq.
[support vector machine] 90 80
[vector machine] 95 0
[support vector] 100 20
Quality
phrases
Based on significance score [Church et al.’91]:
α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2)
23
Collocation Mining
 Collocation: A sequence of words that occur more frequently than expected
 Often “interesting” and due to their non-compositionality, often relay information
not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”)
 Many different measures used to extract collocations from a corpus [Dunning 93,
Pederson 96]
 E.g., mutual information, t-test, z-test, chi-squared test, likelihood ratio
 Many of these measures can be used to guide the agglomerative phrase-
segmentation algorithm
24
ToPMine: Phrase LDA (Constrained Topic Modeling)
 The generative model for PhraseLDA is the same as
LDA
 Difference: the model incorporates constraints
obtained from the “bag-of-phrases” input
 Chain-graph shows that all words in a phrase are
constrained to take on the same topic values
[knowledge discovery] using [least squares]
[support vector machine] [classifiers] …
Topic model inference with phrase constraints
25
Example Topical Phrases: A Comparison
ToPMine [El-kishky et al. 14]
Strategy 3 (67 seconds)
information retrieval feature selection
social networks machine learning
web search semi supervised
search engine large scale
information
extraction
support vector
machines
question answering active learning
web pages face recognition
: :
Topic 1 Topic 2
social networks information retrieval
web search text classification
time series machine learning
search engine support vector machines
management system information extraction
real time neural networks
decision trees text categorization
: :
Topic 1 Topic 2
PDLDA [Lindsey et al. 12] Strategy 1
(3.72 hours)
26
ToPMine: Experiments on DBLP Abstracts
27
ToPMine: Topics on Associate Press News (1989)
28
ToPMine: Experiments on Yelp Reviews
Session 5. A Comparative
Study of Three Strategies
30
Efficiency: Running Time of Different Strategies
 Strategy 1: Generate bag-of-words → generate sequence of tokens
 Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
 Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model
Running time: strategy 3 > strategy 2 > strategy 1 (“>” means outperforms)
31
Coherence of Topics: Comparison of Strategies
 Strategy 1: Generate bag-of-words → generate sequence of tokens
 Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
 Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model
Coherence measured by z-score: strategy 3 > strategy 2 > strategy 1
32
Phrase Intrusion: Comparison of Strategies
Phrase intrusion measured by average number of correct answers:
strategy 3 > strategy 2 > strategy 1
33
Phrase Quality: Comparison of Strategies
Phrase quality measured by z-score:
strategy 3 > strategy 2 > strategy 1
34
Summary: Strategies on Topical Phrase Mining
 Strategy 1: Generate bag-of-words → generate sequence of tokens
 Integrated complex model; phrase quality and topic inference rely on each
other
 Slow and overfitting
 Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
 Phrase quality relies on topic labels for unigrams
 Can be fast; generally high-quality topics and phrases
 Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the
bag-of-words model
 Topic inference relies on correct segmentation of documents, but not sensitive
 Can be fast; generally high-quality topics and phrases
35
Session 6. Mining Quality Phrases
from Massive Text Corpora: A
SegPhrase Approach
37
Mining Phrases: Why Not Use Raw Frequency Based
Methods?
 Traditional data-driven approaches
 Frequent pattern mining
 If AB is frequent, likely AB could be a phrase
 Raw frequency could NOT reflect the quality of phrases
 E.g., freq(vector machine) ≥ freq(support vector machine)
 Need to rectify the frequency based on segmentation results
 Phrasal segmentation will tell
 Some words should be treated as a whole phrase whereas others are still
unigrams
38
SegPhrase: From Raw Corpus
to Quality Phrases and Segmented Corpus
Document 1
Citation recommendation is an interesting but
challenging research problem in data mining area.
Document 2
In this study, we investigate the problem in the
context of heterogeneous information networks
using data mining technique.
Phrase Mining
Document 3
Principal Component Analysis is a linear
dimensionality reduction technique commonly used
in machine learning applications.
Quality Phrases
Phrasal Segmentation
Raw Corpus Segmented Corpus
Input Raw Corpus Quality Phrases Segmented Corpus
39
SegPhrase: The Overall Framework
 ClassPhrase: Frequent pattern mining, feature extraction, classification
 SegPhrase: Phrasal segmentation and phrase quality estimation
 SegPhrase+: One more round to enhance mined phrase quality
ClassPhrase SegPhrase(+)
40
What Kind of Phrases Are of “High Quality”?
 Judging the quality of phrases
 Popularity
 “information retrieval” vs. “cross-language information retrieval”
 Concordance
 “powerful tea” vs. “strong tea”
 “active learning” vs. “learning classification”
 Informativeness
 “this paper” (frequent but not discriminative, not informative)
 Completeness
 “vector machine” vs. “support vector machine”
41
ClassPhrase I: Pattern Mining for Candidate Set
 Build a candidate phrases set by frequent pattern mining
 Mining frequent k-grams
 k is typically small, e.g. 6 in our experiments
 Popularity measured by raw frequent words and phrases mined from the corpus
42
ClassPhrase II: Feature Extraction: Concordance
 Partition a phrase into two parts to check whether the co-occurrence is
significantly higher than pure random
 support vector machine this paper demonstrates
 Pointwise mutual information:
 Pointwise KL divergence:
 The additional p(v) is multiplied with pointwise mutual information,
leading to less bias towards rare-occurred phrases
𝑢𝑙 𝑢𝑙𝑢 𝑟
𝑢 𝑟
43
ClassPhrase II: Feature Extraction: Informativeness
 Deriving Informativeness
 Quality phrases typically start and end with a non-stopword
 “machine learning is” vs. “machine learning”
 Use average IDF over words in the phrase to measure the semantics
 Usually, the probabilities of a quality phrase in quotes, brackets, or connected
by dash should be higher (punctuations information)
 “state-of-the-art”
 We can also incorporate features using some NLP techniques, such as POS
tagging, chunking, and semantic parsing
44
ClassPhrase III: Classifier
 Limited Training
 Labels: Whether a phrase is a quality one or not
 “support vector machine”: 1
 “the experiment shows”: 0
 For ~1GB corpus, only 300 labels
 Random Forest as our classifier
 Predicted phrase quality scores lie in [0, 1]
 Bootstrap many different datasets from limited labels
45
SegPhrase: Why Do We Need Phrasal
Segmentation in Corpus?
 Phrasal segmentation can tell which phrase is more appropriate
 Ex: A standard ⌈feature vector⌋ ⌈machine learning⌋ setup is used to describe...
 Rectified phrase frequency (expected influence)
 Example:
Not counted towards the rectified frequency
46
SegPhrase: Segmentation of Phrases
 Partition a sequence of word by maximizing the likelihood
 Considering
 Phrase quality score
 ClassPhrase assigns a quality score for each phrase
 Probability in corpus
 Length penalty
 length penalty 𝛼: when 𝛼 > 1, it favors shorter phrases
 Filter out phrases with low rectified frequency
 Bad phrases are expected to rarely occur in the segmentation results
47
SegPhrase+: Enhancing Phrasal Segmentation
 SegPhrase+: One more round for enhanced phrasal segmentation
 Feedback
 Using rectified frequency, re-compute those features previously computing
based on raw frequency
 Process
 Classification  Phrasal segmentation // SegPhrase
 Classification  Phrasal segmentation // SegPhrase+
 Effects on computing quality scores
 np hard in the strong sense
 np hard in the strong
 data base management system
48
Performance Study: Methods to Be Compared
 Other phase mining methods: Methods to be compared
 NLP chunking based methods
 Chunks as candidates
 Sorted by TF-IDF and C-value (K. Frantzi et al., 2000)
 Unsupervised raw frequency based methods
 ConExtr (A. Parameswaran et al., VLDB 2010)
 ToPMine (A. El-Kishky et al., VLDB 2015)
 Supervised method
 KEA, designed for single document keyphrases (O. Medelyan & I. H. Witten,
2006)
49
Performance Study: Experimental Setting
 Datasets
 Popular Wiki Phrases
 Based on internal links
 ~7K high quality phrases
 Pooling
 Sampled 500 * 7 Wiki-uncovered phrases
 Evaluated by 3 reviewers independently
Dataset #docs #words #labels
DBLP 2.77M 91.6M 300
Yelp 4.75M 145.1M 300
50
Performance: Precision Recall Curves on DBLP
Compare
with other
baselines
TF-IDF
C-Value
ConExtr
KEA
ToPMine
SegPhrase+
Compare with
our 3 variations
TF-IDF
ClassPhrase
SegPhrase
SegPhrase+
50
51
Performance Study: Processing Efficiency
 SegPhrase+ is linear to the size of corpus!
52
Experimental Results: Interesting Phrases
Generated (From the Titles and Abstracts of SIGMOD)
Query SIGMOD
Method SegPhrase+ Chunking (TF-IDF & C-Value)
1 data base data base
2 database system database system
3 relational database query processing
4 query optimization query optimization
5 query processing relational database
… … …
51 sql server database technology
52 relational data database server
53 data structure large volume
54 join query performance study
55 web service web service
… … …
201 high dimensional data efficient implementation
202 location based service sensor network
203 xml schema large collection
204 two phase locking important issue
205 deep web frequent itemset
… … …
Only in SegPhrase+ Only in Chunking
53
Experimental Results: Interesting Phrases
Generated (From the Titles and Abstracts of SIGKDD)
Query SIGKDD
Method SegPhrase+ Chunking (TF-IDF & C-Value)
1 data mining data mining
2 data set association rule
3 association rule knowledge discovery
4 knowledge discovery frequent itemset
5 time series decision tree
… … …
51 association rule mining search space
52 rule set domain knowledge
53 concept drift importnant problem
54 knowledge acquisition concurrency control
55 gene expression data conceptual graph
… … …
201 web content optimal solution
202 frequent subgraph semantic relationship
203 intrusion detection effective way
204 categorical attribute space complexity
205 user preference small set
… … …
Only in SegPhrase+ Only in Chunking
54
Experimental Results: Similarity Search
 Find high-quality similar phrases based on user’s phrase query
 In response to a user’s phrase query, SegPhrase+ generates high quality,
semantically similar phrases
 In DBLP, query on “data mining” and “OLAP”
 In Yelp, query on “blu-ray”, “noodle”, and “valet parking”
55
Experimental Results: High Quality Phrases
Generated (Top-Ranked Phrases From English Gigaword)
 Northrop Grumman, Ashfaq Kayani, Sania Mirza, Pius Xii, Shakhtar Donetsk, Kyaw Zaw Lwin
 Ratko Mladic, Abdolmalek Rigi, Rubin Kazan, Rajon Rondo, Rubel Hossain, bluefin tuna
 Psv Eindhoven, Nicklas Bendtner, Ryo Ishikawa, Desmond Tutu, Landon Donovan, Jannie du Plessis
 Zinedine Zidane, Uttar Pradesh, Thor Hushovd, Andhra Pradesh, Jafar_Panahi, Marouane Chamakh
 Rahm Emanuel, Yakubu Aiyegbeni, Salva Kiir, Abdelhamid Abou Zeid, Blaise Compaore, Rickie Fowler
 Andry Rajoelina, Merck Kgaa, Js Kabylie, Arjun Atwal, Andal Ampatuan Jnr, Reggio Calabria, Ghani Baradar
 Mahela Jayawardene, Jemaah Islamiyah, quantitative easing, Nodar Kumaritashvili, Alviro Petersen
 Rumiana Jeleva, Helio Castroneves, Koumei Oda, Porfirio Lobo, Anastasia Pavlyuchenkova
 Thaksin Shinawatra, Evgeni_Malkin, Salvatore Sirigu, Edoardo Molinari, Yoshito Sengoku
 Otago Highlanders, Umar Akmal, Shuaibu Amodu, Nadia Petrova, Jerzy Buzek, Leonid Kuchma,
 Alona Bondarenko, Chosun Ilbo, Kei Nishikori, Nobunari Oda, Kumbh Mela, Santo_Domingo
 Nicolae Ceausescu, Yoann Gourcuff, Petr Cech, Mirlande Manigat, Sulieman Benn, Sekouba Konate
56
Recent Progress
 Distant Training: No need of human labeling
 Training using general knowledge bases
 E.g., Freebase, Wikipedia
 Quality Estimation for Unigrams
 Integration of phrases and unigrams in one uniform framework
 Demo based on DBLP abstract
 Multi-languages: Beyond English corpus
 Extensible to mining quality phrases in multiple languages
 Recent progress: SegPhrase+ works on Chinese and Arabic
57
Mining Quality Phrases in Multiple Languages
Rank Phrase In English
… … …
62 首席_执行官 CEO
63 中间_偏右 Middle-right
… … …
84 百度_百科 Baidu Pedia
85 热带_气旋 Tropical cyclone
86 中国科学院_院士 Fellow of Chinese
Academy of Sciences
… … …
1001 十大_中文_金曲 Top-10 Chinese Songs
1002 全球_资讯网 Global News Website
1003 天一阁_藏_明代_科举_录_选刊 A Chinese book name
… … …
9934 国家_戏剧_院 National Theater
9935 谢谢_你 Thank you
… … …
 Both ToPMine and SegPhrase+ are
extensible to mining quality
phrases in multiple languages
 SegPhrase+ on Chinese (From
Chinese Wikipedia)
 ToPMine on Arabic (From Quran
(Fus7a Arabic)(no preprocessing)
 Experimental results of Arabic
phrases:
‫كفروا‬ Those who disbelieve
‫بسم‬‫هللا‬‫الرحمن‬‫الرحيم‬  In the name
of God the Gracious and Merciful
58
Summary on SegPrhase+ and Future Work
 SegPhrase+: A new phrase mining framework
 Integrating phrase mining with phrasal segmentation
 Requires only limited training or distant training
 Generates high-quality phrases, close to human judgement
 Linearly scalable on time and space
 Looking forward: High-quality, scalable phrase mining
 Facilitate entity recognition and typing in large corpora
 (See the next part of this tutorial)
 Transform massive unstructured data into semi-structured knowledge
networks
59
60
References
 D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions, arXiv:0907.1013, 2009
 K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition:
Exploiting On-Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991
 M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical
Keyphrases on Collections of Short Documents“, SDM’14
 A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15
 K. Frantzi, S. Ananiadou, and H. Mima, Automatic Recognition of Multi-Word Terms: the c-value/nc-value
Method. Int. Journal on Digital Libraries, 3(2), 2000
 R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase-discovering topic model using hierarchical pitman-yor
processes, EMNLP-CoNLL’12.
 J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15
 O. Medelyan and I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing. IJCDL’06
 Q. Mei, X. Shen, C. Zhai. Automatic Labeling of Multinomial Topic Models, KDD’07
 A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from
Large Datasets. VLDB’10
 X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information
retrieval, ICDM’07

More Related Content

Similar to 2015 07-tuto1-phrase mining

Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query Understanding
Abhay Prakash
 
ICMT 2016: Search-Based Model Transformations with MOMoT
ICMT 2016: Search-Based Model Transformations with MOMoTICMT 2016: Search-Based Model Transformations with MOMoT
ICMT 2016: Search-Based Model Transformations with MOMoT
Martin Fleck
 
Introduction to Machine Learning.
Introduction to Machine Learning.Introduction to Machine Learning.
Introduction to Machine Learning.butest
 
Final presentation
Final presentationFinal presentation
Final presentation
Nitish Upreti
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
Sebastian Ruder
 
text classification_NB.ppt
text classification_NB.ppttext classification_NB.ppt
text classification_NB.ppt
RithikRaj25
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
Bhaskar Mitra
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental survey
ICDEcCnferenece
 
E017252831
E017252831E017252831
E017252831
IOSR Journals
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Mining
iosrjce
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic Processing
Na'im Tyson
 
lecture14-learning-ranking.ppt
lecture14-learning-ranking.pptlecture14-learning-ranking.ppt
lecture14-learning-ranking.ppt
VishalKumar725248
 
lecture14-learning-ranking.ppt
lecture14-learning-ranking.pptlecture14-learning-ranking.ppt
lecture14-learning-ranking.ppt
VishalKumar725248
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
Paul Lo
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
IJwest
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
dannyijwest
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine Reading
Isabelle Augenstein
 
Resume.doc
Resume.docResume.doc
Resume.docbutest
 

Similar to 2015 07-tuto1-phrase mining (20)

Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query Understanding
 
ICMT 2016: Search-Based Model Transformations with MOMoT
ICMT 2016: Search-Based Model Transformations with MOMoTICMT 2016: Search-Based Model Transformations with MOMoT
ICMT 2016: Search-Based Model Transformations with MOMoT
 
Introduction to Machine Learning.
Introduction to Machine Learning.Introduction to Machine Learning.
Introduction to Machine Learning.
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
text classification_NB.ppt
text classification_NB.ppttext classification_NB.ppt
text classification_NB.ppt
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental survey
 
E017252831
E017252831E017252831
E017252831
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Mining
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic Processing
 
lecture14-learning-ranking.ppt
lecture14-learning-ranking.pptlecture14-learning-ranking.ppt
lecture14-learning-ranking.ppt
 
lecture14-learning-ranking.ppt
lecture14-learning-ranking.pptlecture14-learning-ranking.ppt
lecture14-learning-ranking.ppt
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine Reading
 
Resume.doc
Resume.docResume.doc
Resume.doc
 

More from jins0618

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
jins0618
 
Latent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite NetworksLatent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite Networks
jins0618
 
Web Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet EnvironmentsWeb Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet Environments
jins0618
 
吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践
jins0618
 
李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究
jins0618
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
jins0618
 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
jins0618
 
Jeffrey xu yu large graph processing
Jeffrey xu yu large graph processingJeffrey xu yu large graph processing
Jeffrey xu yu large graph processing
jins0618
 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...
jins0618
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
jins0618
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
jins0618
 
Wang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationWang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configuration
jins0618
 
Wang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdWang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under threshold
jins0618
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
jins0618
 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hin
jins0618
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
jins0618
 
Weiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisWeiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysis
jins0618
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
jins0618
 
Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...
jins0618
 
Chengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big dataChengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big data
jins0618
 

More from jins0618 (20)

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
 
Latent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite NetworksLatent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite Networks
 
Web Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet EnvironmentsWeb Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet Environments
 
吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践
 
李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
 
Jeffrey xu yu large graph processing
Jeffrey xu yu large graph processingJeffrey xu yu large graph processing
Jeffrey xu yu large graph processing
 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
 
Wang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationWang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configuration
 
Wang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdWang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under threshold
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hin
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
 
Weiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisWeiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysis
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
 
Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...
 
Chengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big dataChengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big data
 

Recently uploaded

Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 

Recently uploaded (20)

Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 

2015 07-tuto1-phrase mining

  • 1. 1
  • 2. Phrase Mining and Topic Modeling from Large Corpora JIAWEI HAN COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN JUNE 16, 2016 2
  • 3. 3 Why Phrase Mining?  Unigrams vs. phrases  Unigrams (single words) are ambiguous Example: “United”: United States? United Airline? United Parcel Service?  Phrase: A natural, meaningful, unambiguous semantic unit Example: “United States” vs. “United Airline”  Mining semantically meaningful phrases  Transform text data from word granularity to phrase granularity  Enhance the power and efficiency at manipulating unstructured data using database technology
  • 4. 4  Phrase mining: Originated from the NLP community—“Chunking”  Model it as a sequence labeling problem (B-NP, I-NP, O, …)  Need annotation and training  Annotate hundreds of documents as training data  Train a supervised model based on part-of-speech features  Recent trend:  Use distributional features based on web n-grams (Bergsma et al., 2010)  State-of-the-art performance: ~95% accuracy, ~88% phrase-level F-score  Limitations  High annotation cost, not scalable to a new language, a new domain or genre  May not fit domain-specific, dynamic, emerging applications  Scientific domains, query logs, or social media, e.g., Yelp, Twitter Mining Phrases: Why Not Use NLP Methods?
  • 5. 5 Data Mining Approaches  General principle: Fully exploit information redundancy and data- driven criteria to determine phrase boundaries and salience  Phrase Mining and Topic Modeling from Large Corpora  Strategy 1: Simultaneously Inferring Phrases and Topics  Strategy 2: Post Topic Modeling Phrase Construction  Strategy 3: First Phrase Mining then Topic Modeling (ToPMine)  Integration of Phrase Mining with Document Segmentation
  • 6. Session 1. Frequent Pattern Mining for Text Data
  • 7. 7 Frequent Pattern Mining for Text Data: Phrase Mining and Topic Modeling  Motivation: Unigrams (single words) can be difficult to interpret  Ex.: The topic that represents the area of Machine Learning learning reinforcement support machine vector selection feature random : versus learning support vector machines reinforcement learning feature selection conditional random fields classification decision trees :
  • 8. 8 Various Strategies: Phrase-Based Topic Modeling  Strategy 1: Generate bag-of-words → generate sequence of tokens  Bigram topical model [Wallach’06], topical n-gram model [Wang, et al.’07], phrase discovering topic model [Lindsey, et al.’12]  Strategy 2: Post bag-of-words model inference, visualize topics with n-grams  Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky, et al.’14]  Strategy 3: Prior bag-of-words model inference, mine phrases and impose on the bag-of-words model  ToPMine [El-kishky, et al.’15]
  • 9. Session 2. Strategy 1: Simultaneously Inferring Phrases and Topics
  • 10. 10 Strategy 1: Simultaneously Inferring Phrases and Topics  Bigram Topic Model [Wallach’06]  Probabilistic generative model that conditions on previous word and topic when drawing next word  Topical N-Grams (TNG) [Wang, et al.’07]  Probabilistic model that generates words in textual order  Create n-grams by concatenating successive bigrams (a generalization of Bigram Topic Model)  Phrase-Discovering LDA (PDLDA) [Lindsey, et al.’12]  Viewing each sentence as a time-series of words, PDLDA posits that the generative parameter (topic) changes periodically  Each word is drawn based on previous m words (context) and current phrase topic  High model complexity: Tends to overfitting; High inference cost: Slow
  • 11. 11 TNG: Experiments on Research Papers
  • 12. 12 TNG: Experiments on Research Papers (2)
  • 13. Session 3. Strategy 2: Post Topic Modeling Phrase Construction
  • 14. 14 Strategy 2: Post Topic Modeling Phrase Construction  TurboTopics [Blei & Lafferty’09] – Phrase construction as a post-processing step to Latent Dirichlet Allocation  Perform Latent Dirichlet Allocation on corpus to assign each token a topic label  Merge adjacent unigrams with the same topic label by a distribution-free permutation test on arbitrary-length back-off model  End recursive merging when all significant adjacent unigrams have been merged  KERT [Danilevsky et al.’14] – Phrase construction as a post-processing step to Latent Dirichlet Allocation  Perform frequent pattern mining on each topic  Perform phrase ranking based on four different criteria
  • 15. 15 Example of TurboTopics  Perform LDA on corpus to assign each token a topic label  E.g., … phase11 transition11 …. game153 theory127 …  Then merge adjacent unigrams with same topic label
  • 16. 16 KERT: Topical Keyphrase Extraction & Ranking [Danilevsky, et al. 2014] learning support vector machines reinforcement learning feature selection conditional random fields classification decision trees : Topical keyphrase extraction & ranking knowledge discovery using least squares support vector machine classifiers support vectors for reinforcement learning a hybrid approach to feature selection pseudo conditional random fields automatic web page classification in a dynamic and hierarchical way inverse time dependency in convex regularized learning postprocessing decision trees to extract actionable knowledge variance minimization least squares support vector machines … Unigram topic assignment: Topic 1 & Topic 2
  • 17. 17 Framework of KERT 1. Run bag-of-words model inference and assign topic label to each token 2. Extract candidate keyphrases within each topic 3. Rank the keyphrases in each topic  Popularity: ‘information retrieval’ vs. ‘cross-language information retrieval’  Discriminativeness: only frequent in documents about topic t  Concordance: ‘active learning’ vs.‘learning classification’  Completeness: ‘vector machine’ vs. ‘support vector machine’ Frequent pattern mining Comparability property: directly compare phrases of mixed lengths
  • 18. 18 Comparison of phrase ranking methods The topic that represents the area of Machine Learning kpRel [Zhao et al. 11] KERT (-popularity) KERT (-discriminativeness) KERT (-concordance) KERT [Danilevsky et al. 14] learning effective support vector machines learning learning classification text feature selection classification support vector machines selection probabilistic reinforcement learning selection reinforcement learning models identification conditional random fields feature feature selection algorithm mapping constraint satisfaction decision conditional random fields features task decision trees bayesian classification decision planning dimensionality reduction trees decision trees : : : : : KERT: Topical Phrases on Machine Learning Top-Ranked Phrases by Mining Paper Titles in DBLP
  • 19. Session 4. Strategy 3: First Phrase Mining then Topic Modeling
  • 20. 20 Strategy 3: First Phrase Mining then Topic Modeling  ToPMine [El-Kishky et al. VLDB’15]  First phrase construction, then topic mining  Contrast with KERT: topic modeling, then phrase mining  The ToPMine Framework:  Perform frequent contiguous pattern mining to extract candidate phrases and their counts  Perform agglomerative merging of adjacent unigrams as guided by a significance score—This segments each document into a “bag-of-phrases”  The newly formed bag-of-phrases are passed as input to PhraseLDA, an extension of LDA, that constrains all words in a phrase to each sharing the same latent topic
  • 21. 21 Why First Phrase Mining then Topic Modeling ?  With Strategy 2, tokens in the same phrase may be assigned to different topics  Ex. knowledge discovery using least squares support vector machine classifiers…  Knowledge discovery and support vector machine should have coherent topic labels  Solution: switch the order of phrase mining and topic model inference  Techniques  Phrase mining and document segmentation  Topic model inference with phrase constraint [knowledge discovery] using [least squares] [support vector machine] [classifiers] … [knowledge discovery] using [least squares] [support vector machine] [classifiers] …
  • 22. 22 Phrase Mining: Frequent Pattern Mining + Statistical Analysis [Markov blanket] [feature selection] for [support vector machines] [knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]… Phrase Raw freq. True freq. [support vector machine] 90 80 [vector machine] 95 0 [support vector] 100 20 Quality phrases Based on significance score [Church et al.’91]: α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2)
  • 23. 23 Collocation Mining  Collocation: A sequence of words that occur more frequently than expected  Often “interesting” and due to their non-compositionality, often relay information not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”)  Many different measures used to extract collocations from a corpus [Dunning 93, Pederson 96]  E.g., mutual information, t-test, z-test, chi-squared test, likelihood ratio  Many of these measures can be used to guide the agglomerative phrase- segmentation algorithm
  • 24. 24 ToPMine: Phrase LDA (Constrained Topic Modeling)  The generative model for PhraseLDA is the same as LDA  Difference: the model incorporates constraints obtained from the “bag-of-phrases” input  Chain-graph shows that all words in a phrase are constrained to take on the same topic values [knowledge discovery] using [least squares] [support vector machine] [classifiers] … Topic model inference with phrase constraints
  • 25. 25 Example Topical Phrases: A Comparison ToPMine [El-kishky et al. 14] Strategy 3 (67 seconds) information retrieval feature selection social networks machine learning web search semi supervised search engine large scale information extraction support vector machines question answering active learning web pages face recognition : : Topic 1 Topic 2 social networks information retrieval web search text classification time series machine learning search engine support vector machines management system information extraction real time neural networks decision trees text categorization : : Topic 1 Topic 2 PDLDA [Lindsey et al. 12] Strategy 1 (3.72 hours)
  • 26. 26 ToPMine: Experiments on DBLP Abstracts
  • 27. 27 ToPMine: Topics on Associate Press News (1989)
  • 29. Session 5. A Comparative Study of Three Strategies
  • 30. 30 Efficiency: Running Time of Different Strategies  Strategy 1: Generate bag-of-words → generate sequence of tokens  Strategy 2: Post bag-of-words model inference, visualize topics with n-grams  Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model Running time: strategy 3 > strategy 2 > strategy 1 (“>” means outperforms)
  • 31. 31 Coherence of Topics: Comparison of Strategies  Strategy 1: Generate bag-of-words → generate sequence of tokens  Strategy 2: Post bag-of-words model inference, visualize topics with n-grams  Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model Coherence measured by z-score: strategy 3 > strategy 2 > strategy 1
  • 32. 32 Phrase Intrusion: Comparison of Strategies Phrase intrusion measured by average number of correct answers: strategy 3 > strategy 2 > strategy 1
  • 33. 33 Phrase Quality: Comparison of Strategies Phrase quality measured by z-score: strategy 3 > strategy 2 > strategy 1
  • 34. 34 Summary: Strategies on Topical Phrase Mining  Strategy 1: Generate bag-of-words → generate sequence of tokens  Integrated complex model; phrase quality and topic inference rely on each other  Slow and overfitting  Strategy 2: Post bag-of-words model inference, visualize topics with n-grams  Phrase quality relies on topic labels for unigrams  Can be fast; generally high-quality topics and phrases  Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model  Topic inference relies on correct segmentation of documents, but not sensitive  Can be fast; generally high-quality topics and phrases
  • 35. 35
  • 36. Session 6. Mining Quality Phrases from Massive Text Corpora: A SegPhrase Approach
  • 37. 37 Mining Phrases: Why Not Use Raw Frequency Based Methods?  Traditional data-driven approaches  Frequent pattern mining  If AB is frequent, likely AB could be a phrase  Raw frequency could NOT reflect the quality of phrases  E.g., freq(vector machine) ≥ freq(support vector machine)  Need to rectify the frequency based on segmentation results  Phrasal segmentation will tell  Some words should be treated as a whole phrase whereas others are still unigrams
  • 38. 38 SegPhrase: From Raw Corpus to Quality Phrases and Segmented Corpus Document 1 Citation recommendation is an interesting but challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique. Phrase Mining Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications. Quality Phrases Phrasal Segmentation Raw Corpus Segmented Corpus Input Raw Corpus Quality Phrases Segmented Corpus
  • 39. 39 SegPhrase: The Overall Framework  ClassPhrase: Frequent pattern mining, feature extraction, classification  SegPhrase: Phrasal segmentation and phrase quality estimation  SegPhrase+: One more round to enhance mined phrase quality ClassPhrase SegPhrase(+)
  • 40. 40 What Kind of Phrases Are of “High Quality”?  Judging the quality of phrases  Popularity  “information retrieval” vs. “cross-language information retrieval”  Concordance  “powerful tea” vs. “strong tea”  “active learning” vs. “learning classification”  Informativeness  “this paper” (frequent but not discriminative, not informative)  Completeness  “vector machine” vs. “support vector machine”
  • 41. 41 ClassPhrase I: Pattern Mining for Candidate Set  Build a candidate phrases set by frequent pattern mining  Mining frequent k-grams  k is typically small, e.g. 6 in our experiments  Popularity measured by raw frequent words and phrases mined from the corpus
  • 42. 42 ClassPhrase II: Feature Extraction: Concordance  Partition a phrase into two parts to check whether the co-occurrence is significantly higher than pure random  support vector machine this paper demonstrates  Pointwise mutual information:  Pointwise KL divergence:  The additional p(v) is multiplied with pointwise mutual information, leading to less bias towards rare-occurred phrases 𝑢𝑙 𝑢𝑙𝑢 𝑟 𝑢 𝑟
  • 43. 43 ClassPhrase II: Feature Extraction: Informativeness  Deriving Informativeness  Quality phrases typically start and end with a non-stopword  “machine learning is” vs. “machine learning”  Use average IDF over words in the phrase to measure the semantics  Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuations information)  “state-of-the-art”  We can also incorporate features using some NLP techniques, such as POS tagging, chunking, and semantic parsing
  • 44. 44 ClassPhrase III: Classifier  Limited Training  Labels: Whether a phrase is a quality one or not  “support vector machine”: 1  “the experiment shows”: 0  For ~1GB corpus, only 300 labels  Random Forest as our classifier  Predicted phrase quality scores lie in [0, 1]  Bootstrap many different datasets from limited labels
  • 45. 45 SegPhrase: Why Do We Need Phrasal Segmentation in Corpus?  Phrasal segmentation can tell which phrase is more appropriate  Ex: A standard ⌈feature vector⌋ ⌈machine learning⌋ setup is used to describe...  Rectified phrase frequency (expected influence)  Example: Not counted towards the rectified frequency
  • 46. 46 SegPhrase: Segmentation of Phrases  Partition a sequence of word by maximizing the likelihood  Considering  Phrase quality score  ClassPhrase assigns a quality score for each phrase  Probability in corpus  Length penalty  length penalty 𝛼: when 𝛼 > 1, it favors shorter phrases  Filter out phrases with low rectified frequency  Bad phrases are expected to rarely occur in the segmentation results
  • 47. 47 SegPhrase+: Enhancing Phrasal Segmentation  SegPhrase+: One more round for enhanced phrasal segmentation  Feedback  Using rectified frequency, re-compute those features previously computing based on raw frequency  Process  Classification  Phrasal segmentation // SegPhrase  Classification  Phrasal segmentation // SegPhrase+  Effects on computing quality scores  np hard in the strong sense  np hard in the strong  data base management system
  • 48. 48 Performance Study: Methods to Be Compared  Other phase mining methods: Methods to be compared  NLP chunking based methods  Chunks as candidates  Sorted by TF-IDF and C-value (K. Frantzi et al., 2000)  Unsupervised raw frequency based methods  ConExtr (A. Parameswaran et al., VLDB 2010)  ToPMine (A. El-Kishky et al., VLDB 2015)  Supervised method  KEA, designed for single document keyphrases (O. Medelyan & I. H. Witten, 2006)
  • 49. 49 Performance Study: Experimental Setting  Datasets  Popular Wiki Phrases  Based on internal links  ~7K high quality phrases  Pooling  Sampled 500 * 7 Wiki-uncovered phrases  Evaluated by 3 reviewers independently Dataset #docs #words #labels DBLP 2.77M 91.6M 300 Yelp 4.75M 145.1M 300
  • 50. 50 Performance: Precision Recall Curves on DBLP Compare with other baselines TF-IDF C-Value ConExtr KEA ToPMine SegPhrase+ Compare with our 3 variations TF-IDF ClassPhrase SegPhrase SegPhrase+ 50
  • 51. 51 Performance Study: Processing Efficiency  SegPhrase+ is linear to the size of corpus!
  • 52. 52 Experimental Results: Interesting Phrases Generated (From the Titles and Abstracts of SIGMOD) Query SIGMOD Method SegPhrase+ Chunking (TF-IDF & C-Value) 1 data base data base 2 database system database system 3 relational database query processing 4 query optimization query optimization 5 query processing relational database … … … 51 sql server database technology 52 relational data database server 53 data structure large volume 54 join query performance study 55 web service web service … … … 201 high dimensional data efficient implementation 202 location based service sensor network 203 xml schema large collection 204 two phase locking important issue 205 deep web frequent itemset … … … Only in SegPhrase+ Only in Chunking
  • 53. 53 Experimental Results: Interesting Phrases Generated (From the Titles and Abstracts of SIGKDD) Query SIGKDD Method SegPhrase+ Chunking (TF-IDF & C-Value) 1 data mining data mining 2 data set association rule 3 association rule knowledge discovery 4 knowledge discovery frequent itemset 5 time series decision tree … … … 51 association rule mining search space 52 rule set domain knowledge 53 concept drift importnant problem 54 knowledge acquisition concurrency control 55 gene expression data conceptual graph … … … 201 web content optimal solution 202 frequent subgraph semantic relationship 203 intrusion detection effective way 204 categorical attribute space complexity 205 user preference small set … … … Only in SegPhrase+ Only in Chunking
  • 54. 54 Experimental Results: Similarity Search  Find high-quality similar phrases based on user’s phrase query  In response to a user’s phrase query, SegPhrase+ generates high quality, semantically similar phrases  In DBLP, query on “data mining” and “OLAP”  In Yelp, query on “blu-ray”, “noodle”, and “valet parking”
  • 55. 55 Experimental Results: High Quality Phrases Generated (Top-Ranked Phrases From English Gigaword)  Northrop Grumman, Ashfaq Kayani, Sania Mirza, Pius Xii, Shakhtar Donetsk, Kyaw Zaw Lwin  Ratko Mladic, Abdolmalek Rigi, Rubin Kazan, Rajon Rondo, Rubel Hossain, bluefin tuna  Psv Eindhoven, Nicklas Bendtner, Ryo Ishikawa, Desmond Tutu, Landon Donovan, Jannie du Plessis  Zinedine Zidane, Uttar Pradesh, Thor Hushovd, Andhra Pradesh, Jafar_Panahi, Marouane Chamakh  Rahm Emanuel, Yakubu Aiyegbeni, Salva Kiir, Abdelhamid Abou Zeid, Blaise Compaore, Rickie Fowler  Andry Rajoelina, Merck Kgaa, Js Kabylie, Arjun Atwal, Andal Ampatuan Jnr, Reggio Calabria, Ghani Baradar  Mahela Jayawardene, Jemaah Islamiyah, quantitative easing, Nodar Kumaritashvili, Alviro Petersen  Rumiana Jeleva, Helio Castroneves, Koumei Oda, Porfirio Lobo, Anastasia Pavlyuchenkova  Thaksin Shinawatra, Evgeni_Malkin, Salvatore Sirigu, Edoardo Molinari, Yoshito Sengoku  Otago Highlanders, Umar Akmal, Shuaibu Amodu, Nadia Petrova, Jerzy Buzek, Leonid Kuchma,  Alona Bondarenko, Chosun Ilbo, Kei Nishikori, Nobunari Oda, Kumbh Mela, Santo_Domingo  Nicolae Ceausescu, Yoann Gourcuff, Petr Cech, Mirlande Manigat, Sulieman Benn, Sekouba Konate
  • 56. 56 Recent Progress  Distant Training: No need of human labeling  Training using general knowledge bases  E.g., Freebase, Wikipedia  Quality Estimation for Unigrams  Integration of phrases and unigrams in one uniform framework  Demo based on DBLP abstract  Multi-languages: Beyond English corpus  Extensible to mining quality phrases in multiple languages  Recent progress: SegPhrase+ works on Chinese and Arabic
  • 57. 57 Mining Quality Phrases in Multiple Languages Rank Phrase In English … … … 62 首席_执行官 CEO 63 中间_偏右 Middle-right … … … 84 百度_百科 Baidu Pedia 85 热带_气旋 Tropical cyclone 86 中国科学院_院士 Fellow of Chinese Academy of Sciences … … … 1001 十大_中文_金曲 Top-10 Chinese Songs 1002 全球_资讯网 Global News Website 1003 天一阁_藏_明代_科举_录_选刊 A Chinese book name … … … 9934 国家_戏剧_院 National Theater 9935 谢谢_你 Thank you … … …  Both ToPMine and SegPhrase+ are extensible to mining quality phrases in multiple languages  SegPhrase+ on Chinese (From Chinese Wikipedia)  ToPMine on Arabic (From Quran (Fus7a Arabic)(no preprocessing)  Experimental results of Arabic phrases: ‫كفروا‬ Those who disbelieve ‫بسم‬‫هللا‬‫الرحمن‬‫الرحيم‬  In the name of God the Gracious and Merciful
  • 58. 58 Summary on SegPrhase+ and Future Work  SegPhrase+: A new phrase mining framework  Integrating phrase mining with phrasal segmentation  Requires only limited training or distant training  Generates high-quality phrases, close to human judgement  Linearly scalable on time and space  Looking forward: High-quality, scalable phrase mining  Facilitate entity recognition and typing in large corpora  (See the next part of this tutorial)  Transform massive unstructured data into semi-structured knowledge networks
  • 59. 59
  • 60. 60 References  D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions, arXiv:0907.1013, 2009  K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991  M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents“, SDM’14  A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15  K. Frantzi, S. Ananiadou, and H. Mima, Automatic Recognition of Multi-Word Terms: the c-value/nc-value Method. Int. Journal on Digital Libraries, 3(2), 2000  R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase-discovering topic model using hierarchical pitman-yor processes, EMNLP-CoNLL’12.  J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15  O. Medelyan and I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing. IJCDL’06  Q. Mei, X. Shen, C. Zhai. Automatic Labeling of Multinomial Topic Models, KDD’07  A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from Large Datasets. VLDB’10  X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval, ICDM’07

Editor's Notes

  1. An interesting comparison A state of the art phrase-discovering topic model
  2. An interesting comparison A state of the art phrase-discovering topic model
  3. An interesting comparison A state of the art phrase-discovering topic model
  4. An interesting comparison A state of the art phrase-discovering topic model
  5. An interesting comparison A state of the art phrase-discovering topic model
  6. An interesting comparison A state of the art phrase-discovering topic model
  7. Then the picky audience might ask how to project phrase mining methods back to the corpus so we can get mention-level chunking results?  Maybe during presentation we can emphasize that's not the goal, we think it's more interesting to discover important phrases from a large corpus without any manual annotations.