SlideShare a Scribd company logo
(Hierarchical) Topic Modeling
Yueshen Xu (lecturer)
ysxu@xidian.edu.cn / xuyueshen@163.com
Data and Knowledge Engineering Research Center
Xidian University
Text Mining & NLP & ML
Software Engineering2016/12/29
Outline
 Background
 Some Concepts
 Topic Modeling
 Probabilistic Latent Semantic Indexing (PLSI)
 Latent Dirichlet Allocation (LDA)
 Hierarchical Topic Modeling
 Chinese Restaurant Process (CRP)
 What I do
 Supplement & Reference
2
Keywords: topic modeling, hierarchical topic modeling, probabilistic graphical
model, Bayesian model
Basics, not
state-of-the-art
Software Engineering2016/12/29
Background
 Information Overloading
3
we need
summarization
Visualization
Dimensional
Reduction
Big Data
Cloud Computing
Artificial Intelligence
Deep Learning
,…, etc
Software Engineering2016/12/29
Background
 Text Summarization
 Document Summarization
 What do these docs (or this doc) talk about?
 Review Summarization
 What do these consumers care about or complain about?
 Short Text/Tweets Summarization
 What are people discussing about?
4
Automatic Applicable Explainable
 Basic Requirement
Topic Modeling
Software Engineering2016/12/29
 General Concepts
 Latent Semantic Analysis
 Text Mining
 Natural Language Processing
 Computational Linguistics
 Information Retrieval
 Dimension Reduction
 Topic Modeling
Some Concepts
5
Information Retrieval
Computational Linguistics
Natural Language Processing
LSA/Topic Model
Text Mining
LSA
Data Mining
Reduction
Dimension
Machine
Learning
Machine
Translation
Topic
Modeling
 to learn the latent topics from a corpus/document
Software Engineering2016/12/29
Topic Modeling
 Topic modeling
 an example in Chinese (from my doctorate thesis)
6
继续实施稳健的货币政策,保
持松紧适度适时预调微调,做
好与供给侧结构,并综合运用
数量、价格等多种货币政策
从员额上来看,这次改革远远超
过了裁军的数量,它是一种结构
性的改革,是军队组织结构现代
化的一个关键步骤
美元作为主要国际货币的地位在
可预见的将来仍无可取代,唯一
的出路是推动全球治理向更均衡
的方向发展。国际货币基金组织
总裁拉加德日前在美国马里兰大
学演讲时就呼吁,国际治理改革
应认清新兴经济体越来越重要这
一现实。
独立学院从母体高校“断奶”后,
可能会面临品牌、招生等方面阵
痛,但是在国家和省市鼓励民间
资本进入教育领域的实施意见发
布后,一些独立学院果断切割连
接母体大学的“脐带”,自立门
户发展。
Corpus
Doc
1
Doc2
Doc
3 Doc4
Software Engineering2016/12/29
Topic Modeling
 After topic modeling
7
继续实施稳健的货币政策,保
持松紧适度适时预调微调,做
好与供给侧结构,并综合运用
数量、价格等多种货币政策
政策 0.082
改革 0.063
…
金融 0.074
货币 0.051
…
学院 0.077
教育 0.071
…
军队 0.083
组织 0.079
…
从员额上来看,这次改革远远
超过了裁军的数量,它是一种
结构性的改革,是军队组织结
构现代化的一个关键步骤
美元作为主要国际货币的地位
在可预见的将来仍无可取代,
唯一的出路是推动全球治理向
更均衡的方向发展。国际货币
基金组织总裁拉加德日前在美
国马里兰大学演讲时就呼吁,
国际治理改革应认清新兴经济
体越来越重要这一现实。
独立学院从母体高校“断奶”
后,可能会面临品牌、招生等
方面阵痛,但是在国家和省市
鼓励民间资本进入教育领域的
实施意见发布后,一些独立学
院果断切割连接母体大学的
“脐带”,自立门户发展。 …
…
…
…
Corpus
Doc
1
Doc
2
Doc3
Doc
4
Topic
2
Topic
3
Topic
4
Topic
1
Software Engineering2016/12/29
Topic Modeling
 A topic
 A word cluster  a group of words
 Not clustered randomly, but meaningfully (not semantically)
8
 Models
 Parametric models
 Latent Semantic Indexing (LSI)
 PLSI; Latent Dirichlet Allocation (LDA)
 Non-parametric models (Dirichlet Process)
 (Nested) Chinese Restaurant Process
 Indian Buffet Process
 Pitman-Yor Process
Software Engineering2016/12/29
Topic Modeling
9
pLSI Model
w1
w2
wN
z1
zK
z2
d1
d2
dM
…..
…..
…..
)(dp)|( dzp)|( zwp
 Assumption
 Pairs(d,w) are assumed to be
generated independently
 Conditioned on z, w is generated
independently of d
 Words in a document are
exchangeable
 Documents are exchangeable
 Latent topics z are independent
The generative process
∑∑ ∈∈ ZzZz
dzpzwpdpdzwpdpdpdwpwdp )|()|()(=)|,()(=)()|(=),(
Multinomial Distribution
Multinomial Distribution
One layer of ‘Deep
Neutral Network’
Software Engineering2016/12/29
Topic Modeling
10
 Latent Dirichlet Allocation (LDA)
 David M. Blei, Andrew Y. Ng, Michael I. Jordan
 Hierarchical Bayesian model; Bayesian pLSI
θ z w
N
M
α
β
iterative times
Generative process of LDA
 Choose N ~ Poisson(𝜉);
 For each document d={𝑤1, 𝑤2 … 𝑤 𝑛}
Choose 𝜃 ~𝐷𝑖𝑟(𝛼); For each of the N
words 𝑤 𝑛 in d:
a) Choose a topic 𝑧 𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃
b) Choose a word 𝑤 𝑛 from 𝑝 𝑤 𝑛 𝑧 𝑛, 𝛽 ,
a multinomial distribution conditioned on 𝑧 𝑛
Software Engineering2016/12/29
 Gibbs Sampling (MCMC, Markov Chain Monte Carlo)
 ‘I want to know a distribution, but I haven’t known yet, so I find a
way to generate its samples’
 300 lines (code) for LDA, not complex but solid
lim
𝑛→∞
𝜋0 𝑃 𝑛
=
𝜋(1) … 𝜋(|𝑆|)
⋮ ⋮ ⋮
𝜋(1) 𝜋(|𝑆|)
 𝜋 = {𝜋 1 , 𝜋 2 , … , 𝜋 𝑗 , … , 𝜋(|𝑆|)}
Topic Modeling
 Parameter Estimation
 Variational Inference (+EM) :Complex, rarely use
 ‘I want to know a distribution, but I haven’t known yet, so I find a
similar distribution (tight upper bound or lower bound)’
 K-L divergence (or information gain)
11
Stationary Distribution
Software Engineering2016/12/29
Hierarchical Topic Modeling
Topic modeling is not enough
12
Hierarchical
Structure
Software Engineering2016/12/29
Hierarchical Topic Modeling
13
Chinese Restaurant Process (Dirichlet Process)
 A restaurant with an infinite number of tables, and
customers (word) enter this restaurant sequentially. The ith
customer (𝜃𝑖) sits at a table (𝜙 𝑘) according to the probability
𝜙 𝑘: Clustering == 1/2 unsupervised learning  clustering, topic modeling (two layer
clustering), hierarchical concept building, collaborative filtering, similarity computation…
Software Engineering2016/12/29
Hierarchical Topic Modeling
14
 The generative process (nested CRP)
 Focus on the insight
1. Let 𝑐1 be the root restaurant (only one table)
2. For each level 𝑙 ∈ {2, … , 𝐿}:
Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to
by that table
3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)
4. For each word 𝑤 𝑛:
Draw 𝑧 ∈ 1, … , 𝐿 ~ Mult(𝜃)
Draw 𝑤 𝑛 from the topic associated with restaurant 𝑐 𝑧
α
zm,n
N
c1
c2
cL
T
γ
wm,n
M
β
k


m


Matryoshka
(Russia) Doll
Software Engineering2016/12/29
Hierarchical Topic Modeling
Examples
15
root topic analysis obtain base system concentration
thermal
polymer acid
property
diamine
activity compound acid
derivative active
compound ligand group
investigate synergistic
reaction
derivative
yield synthesis
microwave
assay food quality content
analysis
decoction
component
radix quality
constituent
compound
activity
synthesize salt
derivative
antioxidant
activity extract
inhibitory
flavonoid
interaction
cation metal
energy
solution
Software Engineering2016/12/29
What I do
Topic-specific opinion mining
 Goal: automatically learn which group of aspects people like,
dislike, and how people like, and why people like
 Methods: topic model (LDA), Dirichlet process, Gibbs sampling,
etc.
Collaborative recommendation
 Goal: automatically learn which group of products people like,
dislike, and how people like, and why people like
 Methods: matrix factorization, gradient descent, regularization
norm, etc.
 Common basics: Bayesian inference (MLE, MAP, PGM)
16
Software Engineering2016/12/29
Supplement
17
Some supplements
 Probabilistic Graphical Model
 Modeling Bayesian Network using plates and circles
 Generative Model & Discriminative Model: 𝑝(𝜃|𝑋/𝐷𝑎𝑡𝑎)
 Generative Model: p(θ|X) ∝ p(X|θ)p(θ)
- Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning
 Discriminative Model: 𝑝(𝜃|𝑋)
- LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning
Also can be represented by
graphical models
Software Engineering2016/12/29
Reference
 My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)
 ‘Topic modeling (an introduction)’
 ‘Non-parametric Bayesian learning in discrete data’
 ‘The research of topic modeling in text mining’
 ‘Matrix factorization with user generated content’
 …, etc
 Website
 You can download all slides of mine
 http://web.xidian.edu.cn/ysxu/teach.html
 http://liu.cs.uic.edu/yueshenxu/
 http://www.slideshare.net/obamaxys2011
 https://www.researchgate.net/profile/Yueshen_Xu
18
Software Engineering2016/12/29
Reference
• David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003
• Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007
• Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical
Association, 2006
• David Blei. Probabilstic topic models. Communications of the ACM, 2012
• David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of
Topic Hierarchies. Journal of the ACM, 2010
• Gregor Heinrich. Parameter Estimation for Text Analysis, 2008
• T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals
of Statistics, 1973
• Martin J. Wainwright. Graphical Models, Exponential Families, and Variational
Inference
• Rick Durrett. Probability: Theory and Examples, 2010
• Christopher Bishop. Pattern Recognition and Machine Learning, 2007
• Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014
19
Software Engineering2016/12/29 20
Q&A

More Related Content

Similar to (Hierarchical) Topic Modeling_Yueshen Xu

(Hierarchical) topic modeling
(Hierarchical) topic modeling (Hierarchical) topic modeling
(Hierarchical) topic modeling
Yueshen Xu
 
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in R
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in RGentle Introduction: Bayesian Modelling and Probabilistic Programming in R
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in R
Marco Wirthlin
 
Thinking in clustering yueshen xu
Thinking in clustering yueshen xuThinking in clustering yueshen xu
Thinking in clustering yueshen xu
Yueshen Xu
 
Introduction to Model-Based Machine Learning
Introduction to Model-Based Machine LearningIntroduction to Model-Based Machine Learning
Introduction to Model-Based Machine Learning
Daniel Emaasit
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Isabelle Augenstein
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fm
Mark Levy
 
Non parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete dataNon parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete data
Yueshen Xu
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
Majid Abdollahi
 
Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block De...
Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block De...Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block De...
Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block De...
AMIDST Toolbox
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
Tao Xie
 
lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.ppt
Indra Hermawan
 
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
Aalto University
 
Learning to Personalize
Learning to PersonalizeLearning to Personalize
Learning to Personalize
Justin Basilico
 
Антон Кириллов, ZeptoLab
Антон Кириллов, ZeptoLabАнтон Кириллов, ZeptoLab
Антон Кириллов, ZeptoLab
Diana Dymolazova
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
dgarijo
 
Deep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorchDeep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorch
inside-BigData.com
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
Aapo Kyrölä
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
Yueshen Xu
 
End-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceEnd-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in Finance
Jim Dowling
 

Similar to (Hierarchical) Topic Modeling_Yueshen Xu (20)

(Hierarchical) topic modeling
(Hierarchical) topic modeling (Hierarchical) topic modeling
(Hierarchical) topic modeling
 
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in R
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in RGentle Introduction: Bayesian Modelling and Probabilistic Programming in R
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in R
 
Thinking in clustering yueshen xu
Thinking in clustering yueshen xuThinking in clustering yueshen xu
Thinking in clustering yueshen xu
 
Introduction to Model-Based Machine Learning
Introduction to Model-Based Machine LearningIntroduction to Model-Based Machine Learning
Introduction to Model-Based Machine Learning
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fm
 
Non parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete dataNon parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete data
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block De...
Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block De...Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block De...
Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block De...
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
 
lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.ppt
 
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
 
Learning to Personalize
Learning to PersonalizeLearning to Personalize
Learning to Personalize
 
Антон Кириллов, ZeptoLab
Антон Кириллов, ZeptoLabАнтон Кириллов, ZeptoLab
Антон Кириллов, ZeptoLab
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
 
Deep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorchDeep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorch
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
 
End-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceEnd-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in Finance
 

More from Yueshen Xu

Context aware service recommendation
Context aware service recommendationContext aware service recommendation
Context aware service recommendation
Yueshen Xu
 
Course review for ir class 本科课件
Course review for ir class 本科课件Course review for ir class 本科课件
Course review for ir class 本科课件
Yueshen Xu
 
Semantic web 本科课件
Semantic web 本科课件Semantic web 本科课件
Semantic web 本科课件
Yueshen Xu
 
Recommender system slides for undergraduate
Recommender system slides for undergraduateRecommender system slides for undergraduate
Recommender system slides for undergraduate
Yueshen Xu
 
推荐系统 本科课件
 推荐系统 本科课件 推荐系统 本科课件
推荐系统 本科课件
Yueshen Xu
 
Text classification 本科课件
Text classification 本科课件Text classification 本科课件
Text classification 本科课件
Yueshen Xu
 
Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)
Yueshen Xu
 
聚类 (Clustering)
聚类 (Clustering)聚类 (Clustering)
聚类 (Clustering)
Yueshen Xu
 
Yueshen xu cv
Yueshen xu cvYueshen xu cv
Yueshen xu cv
Yueshen Xu
 
徐悦甡简历
徐悦甡简历徐悦甡简历
徐悦甡简历
Yueshen Xu
 
Learning to recommend with user generated content
Learning to recommend with user generated contentLearning to recommend with user generated content
Learning to recommend with user generated content
Yueshen Xu
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender system
Yueshen Xu
 
Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013
Yueshen Xu
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
Yueshen Xu
 
Acoustic modeling using deep belief networks
Acoustic modeling using deep belief networksAcoustic modeling using deep belief networks
Acoustic modeling using deep belief networks
Yueshen Xu
 
Summarization for dragon star program
Summarization for dragon  star programSummarization for dragon  star program
Summarization for dragon star programYueshen Xu
 
Aggregation computation over distributed data streams
Aggregation computation over distributed data streamsAggregation computation over distributed data streams
Aggregation computation over distributed data streams
Yueshen Xu
 
Analysis on tcp ip protocol stack
Analysis on tcp ip protocol stackAnalysis on tcp ip protocol stack
Analysis on tcp ip protocol stack
Yueshen Xu
 
Simple conclusion for sap tech ed 2011
Simple conclusion for sap tech ed 2011Simple conclusion for sap tech ed 2011
Simple conclusion for sap tech ed 2011
Yueshen Xu
 
Stream data mining & CluStream framework
Stream data mining & CluStream frameworkStream data mining & CluStream framework
Stream data mining & CluStream framework
Yueshen Xu
 

More from Yueshen Xu (20)

Context aware service recommendation
Context aware service recommendationContext aware service recommendation
Context aware service recommendation
 
Course review for ir class 本科课件
Course review for ir class 本科课件Course review for ir class 本科课件
Course review for ir class 本科课件
 
Semantic web 本科课件
Semantic web 本科课件Semantic web 本科课件
Semantic web 本科课件
 
Recommender system slides for undergraduate
Recommender system slides for undergraduateRecommender system slides for undergraduate
Recommender system slides for undergraduate
 
推荐系统 本科课件
 推荐系统 本科课件 推荐系统 本科课件
推荐系统 本科课件
 
Text classification 本科课件
Text classification 本科课件Text classification 本科课件
Text classification 本科课件
 
Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)
 
聚类 (Clustering)
聚类 (Clustering)聚类 (Clustering)
聚类 (Clustering)
 
Yueshen xu cv
Yueshen xu cvYueshen xu cv
Yueshen xu cv
 
徐悦甡简历
徐悦甡简历徐悦甡简历
徐悦甡简历
 
Learning to recommend with user generated content
Learning to recommend with user generated contentLearning to recommend with user generated content
Learning to recommend with user generated content
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender system
 
Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
Acoustic modeling using deep belief networks
Acoustic modeling using deep belief networksAcoustic modeling using deep belief networks
Acoustic modeling using deep belief networks
 
Summarization for dragon star program
Summarization for dragon  star programSummarization for dragon  star program
Summarization for dragon star program
 
Aggregation computation over distributed data streams
Aggregation computation over distributed data streamsAggregation computation over distributed data streams
Aggregation computation over distributed data streams
 
Analysis on tcp ip protocol stack
Analysis on tcp ip protocol stackAnalysis on tcp ip protocol stack
Analysis on tcp ip protocol stack
 
Simple conclusion for sap tech ed 2011
Simple conclusion for sap tech ed 2011Simple conclusion for sap tech ed 2011
Simple conclusion for sap tech ed 2011
 
Stream data mining & CluStream framework
Stream data mining & CluStream frameworkStream data mining & CluStream framework
Stream data mining & CluStream framework
 

(Hierarchical) Topic Modeling_Yueshen Xu

  • 1. (Hierarchical) Topic Modeling Yueshen Xu (lecturer) ysxu@xidian.edu.cn / xuyueshen@163.com Data and Knowledge Engineering Research Center Xidian University Text Mining & NLP & ML
  • 2. Software Engineering2016/12/29 Outline  Background  Some Concepts  Topic Modeling  Probabilistic Latent Semantic Indexing (PLSI)  Latent Dirichlet Allocation (LDA)  Hierarchical Topic Modeling  Chinese Restaurant Process (CRP)  What I do  Supplement & Reference 2 Keywords: topic modeling, hierarchical topic modeling, probabilistic graphical model, Bayesian model Basics, not state-of-the-art
  • 3. Software Engineering2016/12/29 Background  Information Overloading 3 we need summarization Visualization Dimensional Reduction Big Data Cloud Computing Artificial Intelligence Deep Learning ,…, etc
  • 4. Software Engineering2016/12/29 Background  Text Summarization  Document Summarization  What do these docs (or this doc) talk about?  Review Summarization  What do these consumers care about or complain about?  Short Text/Tweets Summarization  What are people discussing about? 4 Automatic Applicable Explainable  Basic Requirement Topic Modeling
  • 5. Software Engineering2016/12/29  General Concepts  Latent Semantic Analysis  Text Mining  Natural Language Processing  Computational Linguistics  Information Retrieval  Dimension Reduction  Topic Modeling Some Concepts 5 Information Retrieval Computational Linguistics Natural Language Processing LSA/Topic Model Text Mining LSA Data Mining Reduction Dimension Machine Learning Machine Translation Topic Modeling  to learn the latent topics from a corpus/document
  • 6. Software Engineering2016/12/29 Topic Modeling  Topic modeling  an example in Chinese (from my doctorate thesis) 6 继续实施稳健的货币政策,保 持松紧适度适时预调微调,做 好与供给侧结构,并综合运用 数量、价格等多种货币政策 从员额上来看,这次改革远远超 过了裁军的数量,它是一种结构 性的改革,是军队组织结构现代 化的一个关键步骤 美元作为主要国际货币的地位在 可预见的将来仍无可取代,唯一 的出路是推动全球治理向更均衡 的方向发展。国际货币基金组织 总裁拉加德日前在美国马里兰大 学演讲时就呼吁,国际治理改革 应认清新兴经济体越来越重要这 一现实。 独立学院从母体高校“断奶”后, 可能会面临品牌、招生等方面阵 痛,但是在国家和省市鼓励民间 资本进入教育领域的实施意见发 布后,一些独立学院果断切割连 接母体大学的“脐带”,自立门 户发展。 Corpus Doc 1 Doc2 Doc 3 Doc4
  • 7. Software Engineering2016/12/29 Topic Modeling  After topic modeling 7 继续实施稳健的货币政策,保 持松紧适度适时预调微调,做 好与供给侧结构,并综合运用 数量、价格等多种货币政策 政策 0.082 改革 0.063 … 金融 0.074 货币 0.051 … 学院 0.077 教育 0.071 … 军队 0.083 组织 0.079 … 从员额上来看,这次改革远远 超过了裁军的数量,它是一种 结构性的改革,是军队组织结 构现代化的一个关键步骤 美元作为主要国际货币的地位 在可预见的将来仍无可取代, 唯一的出路是推动全球治理向 更均衡的方向发展。国际货币 基金组织总裁拉加德日前在美 国马里兰大学演讲时就呼吁, 国际治理改革应认清新兴经济 体越来越重要这一现实。 独立学院从母体高校“断奶” 后,可能会面临品牌、招生等 方面阵痛,但是在国家和省市 鼓励民间资本进入教育领域的 实施意见发布后,一些独立学 院果断切割连接母体大学的 “脐带”,自立门户发展。 … … … … Corpus Doc 1 Doc 2 Doc3 Doc 4 Topic 2 Topic 3 Topic 4 Topic 1
  • 8. Software Engineering2016/12/29 Topic Modeling  A topic  A word cluster  a group of words  Not clustered randomly, but meaningfully (not semantically) 8  Models  Parametric models  Latent Semantic Indexing (LSI)  PLSI; Latent Dirichlet Allocation (LDA)  Non-parametric models (Dirichlet Process)  (Nested) Chinese Restaurant Process  Indian Buffet Process  Pitman-Yor Process
  • 9. Software Engineering2016/12/29 Topic Modeling 9 pLSI Model w1 w2 wN z1 zK z2 d1 d2 dM ….. ….. ….. )(dp)|( dzp)|( zwp  Assumption  Pairs(d,w) are assumed to be generated independently  Conditioned on z, w is generated independently of d  Words in a document are exchangeable  Documents are exchangeable  Latent topics z are independent The generative process ∑∑ ∈∈ ZzZz dzpzwpdpdzwpdpdpdwpwdp )|()|()(=)|,()(=)()|(=),( Multinomial Distribution Multinomial Distribution One layer of ‘Deep Neutral Network’
  • 10. Software Engineering2016/12/29 Topic Modeling 10  Latent Dirichlet Allocation (LDA)  David M. Blei, Andrew Y. Ng, Michael I. Jordan  Hierarchical Bayesian model; Bayesian pLSI θ z w N M α β iterative times Generative process of LDA  Choose N ~ Poisson(𝜉);  For each document d={𝑤1, 𝑤2 … 𝑤 𝑛} Choose 𝜃 ~𝐷𝑖𝑟(𝛼); For each of the N words 𝑤 𝑛 in d: a) Choose a topic 𝑧 𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃 b) Choose a word 𝑤 𝑛 from 𝑝 𝑤 𝑛 𝑧 𝑛, 𝛽 , a multinomial distribution conditioned on 𝑧 𝑛
  • 11. Software Engineering2016/12/29  Gibbs Sampling (MCMC, Markov Chain Monte Carlo)  ‘I want to know a distribution, but I haven’t known yet, so I find a way to generate its samples’  300 lines (code) for LDA, not complex but solid lim 𝑛→∞ 𝜋0 𝑃 𝑛 = 𝜋(1) … 𝜋(|𝑆|) ⋮ ⋮ ⋮ 𝜋(1) 𝜋(|𝑆|)  𝜋 = {𝜋 1 , 𝜋 2 , … , 𝜋 𝑗 , … , 𝜋(|𝑆|)} Topic Modeling  Parameter Estimation  Variational Inference (+EM) :Complex, rarely use  ‘I want to know a distribution, but I haven’t known yet, so I find a similar distribution (tight upper bound or lower bound)’  K-L divergence (or information gain) 11 Stationary Distribution
  • 12. Software Engineering2016/12/29 Hierarchical Topic Modeling Topic modeling is not enough 12 Hierarchical Structure
  • 13. Software Engineering2016/12/29 Hierarchical Topic Modeling 13 Chinese Restaurant Process (Dirichlet Process)  A restaurant with an infinite number of tables, and customers (word) enter this restaurant sequentially. The ith customer (𝜃𝑖) sits at a table (𝜙 𝑘) according to the probability 𝜙 𝑘: Clustering == 1/2 unsupervised learning  clustering, topic modeling (two layer clustering), hierarchical concept building, collaborative filtering, similarity computation…
  • 14. Software Engineering2016/12/29 Hierarchical Topic Modeling 14  The generative process (nested CRP)  Focus on the insight 1. Let 𝑐1 be the root restaurant (only one table) 2. For each level 𝑙 ∈ {2, … , 𝐿}: Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to by that table 3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼) 4. For each word 𝑤 𝑛: Draw 𝑧 ∈ 1, … , 𝐿 ~ Mult(𝜃) Draw 𝑤 𝑛 from the topic associated with restaurant 𝑐 𝑧 α zm,n N c1 c2 cL T γ wm,n M β k   m   Matryoshka (Russia) Doll
  • 15. Software Engineering2016/12/29 Hierarchical Topic Modeling Examples 15 root topic analysis obtain base system concentration thermal polymer acid property diamine activity compound acid derivative active compound ligand group investigate synergistic reaction derivative yield synthesis microwave assay food quality content analysis decoction component radix quality constituent compound activity synthesize salt derivative antioxidant activity extract inhibitory flavonoid interaction cation metal energy solution
  • 16. Software Engineering2016/12/29 What I do Topic-specific opinion mining  Goal: automatically learn which group of aspects people like, dislike, and how people like, and why people like  Methods: topic model (LDA), Dirichlet process, Gibbs sampling, etc. Collaborative recommendation  Goal: automatically learn which group of products people like, dislike, and how people like, and why people like  Methods: matrix factorization, gradient descent, regularization norm, etc.  Common basics: Bayesian inference (MLE, MAP, PGM) 16
  • 17. Software Engineering2016/12/29 Supplement 17 Some supplements  Probabilistic Graphical Model  Modeling Bayesian Network using plates and circles  Generative Model & Discriminative Model: 𝑝(𝜃|𝑋/𝐷𝑎𝑡𝑎)  Generative Model: p(θ|X) ∝ p(X|θ)p(θ) - Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning  Discriminative Model: 𝑝(𝜃|𝑋) - LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning Also can be represented by graphical models
  • 18. Software Engineering2016/12/29 Reference  My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)  ‘Topic modeling (an introduction)’  ‘Non-parametric Bayesian learning in discrete data’  ‘The research of topic modeling in text mining’  ‘Matrix factorization with user generated content’  …, etc  Website  You can download all slides of mine  http://web.xidian.edu.cn/ysxu/teach.html  http://liu.cs.uic.edu/yueshenxu/  http://www.slideshare.net/obamaxys2011  https://www.researchgate.net/profile/Yueshen_Xu 18
  • 19. Software Engineering2016/12/29 Reference • David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003 • Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007 • Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical Association, 2006 • David Blei. Probabilstic topic models. Communications of the ACM, 2012 • David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of Topic Hierarchies. Journal of the ACM, 2010 • Gregor Heinrich. Parameter Estimation for Text Analysis, 2008 • T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, 1973 • Martin J. Wainwright. Graphical Models, Exponential Families, and Variational Inference • Rick Durrett. Probability: Theory and Examples, 2010 • Christopher Bishop. Pattern Recognition and Machine Learning, 2007 • Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014 19