(Hierarchical) Topic Modeling_Yueshen Xu

(Hierarchical) Topic Modeling
Yueshen Xu (lecturer)
ysxu@xidian.edu.cn / xuyueshen@163.com
Data and Knowledge Engineering Research Center
Xidian University
Text Mining & NLP & ML

Software Engineering2016/12/29
Outline
 Background
 Some Concepts
 Topic Modeling
 Probabilistic Latent Semantic Indexing (PLSI)
 Latent Dirichlet Allocation (LDA)
 Hierarchical Topic Modeling
 Chinese Restaurant Process (CRP)
 What I do
 Supplement & Reference
2
Keywords: topic modeling, hierarchical topic modeling, probabilistic graphical
model, Bayesian model
Basics, not
state-of-the-art

Background
 Information Overloading
3
we need
summarization
Visualization
Dimensional
Reduction
Big Data
Cloud Computing
Artificial Intelligence
Deep Learning
,…, etc

Background
 Text Summarization
 Document Summarization
 What do these docs (or this doc) talk about?
 Review Summarization
 What do these consumers care about or complain about?
 Short Text/Tweets Summarization
 What are people discussing about?
4
Automatic Applicable Explainable
 Basic Requirement
Topic Modeling

 General Concepts
 Latent Semantic Analysis
 Text Mining
 Natural Language Processing
 Computational Linguistics
 Information Retrieval
 Dimension Reduction
 Topic Modeling
Some Concepts
5
Information Retrieval
Computational Linguistics
Natural Language Processing
LSA/Topic Model
Text Mining
LSA
Data Mining
Reduction
Dimension
Machine
Learning
Machine
Translation
Topic
Modeling
 to learn the latent topics from a corpus/document

Topic Modeling
 Topic modeling
 an example in Chinese (from my doctorate thesis)
6
继续实施稳健的货币政策，保
持松紧适度适时预调微调，做
好与供给侧结构，并综合运用
数量、价格等多种货币政策
从员额上来看，这次改革远远超
过了裁军的数量，它是一种结构
性的改革，是军队组织结构现代
化的一个关键步骤
美元作为主要国际货币的地位在
可预见的将来仍无可取代，唯一
的出路是推动全球治理向更均衡
的方向发展。国际货币基金组织
总裁拉加德日前在美国马里兰大
学演讲时就呼吁，国际治理改革
应认清新兴经济体越来越重要这
一现实。
独立学院从母体高校“断奶”后，
可能会面临品牌、招生等方面阵
痛，但是在国家和省市鼓励民间
资本进入教育领域的实施意见发
布后，一些独立学院果断切割连
接母体大学的“脐带”，自立门
户发展。
Corpus
Doc
1
Doc2
Doc
3 Doc4

Topic Modeling
 After topic modeling
7
继续实施稳健的货币政策，保
持松紧适度适时预调微调，做
好与供给侧结构，并综合运用
数量、价格等多种货币政策
政策 0.082
改革 0.063
…
金融 0.074
货币 0.051
…
学院 0.077
教育 0.071
…
军队 0.083
组织 0.079
…
从员额上来看，这次改革远远
超过了裁军的数量，它是一种
结构性的改革，是军队组织结
构现代化的一个关键步骤
美元作为主要国际货币的地位
在可预见的将来仍无可取代，
唯一的出路是推动全球治理向
更均衡的方向发展。国际货币
基金组织总裁拉加德日前在美
国马里兰大学演讲时就呼吁，
国际治理改革应认清新兴经济
体越来越重要这一现实。
独立学院从母体高校“断奶”
后，可能会面临品牌、招生等
方面阵痛，但是在国家和省市
鼓励民间资本进入教育领域的
实施意见发布后，一些独立学
院果断切割连接母体大学的
“脐带”，自立门户发展。 …
…
…
…
Corpus
Doc
1
Doc
2
Doc3
Doc
4
Topic
2
Topic
3
Topic
4
Topic
1

Topic Modeling
 A topic
 A word cluster  a group of words
 Not clustered randomly, but meaningfully (not semantically)
8
 Models
 Parametric models
 Latent Semantic Indexing (LSI)
 PLSI; Latent Dirichlet Allocation (LDA)
 Non-parametric models (Dirichlet Process)
 (Nested) Chinese Restaurant Process
 Indian Buffet Process
 Pitman-Yor Process

Topic Modeling
9
pLSI Model
w1
w2
wN
z1
zK
z2
d1
d2
dM
…..
…..
…..
)(dp)|( dzp)|( zwp
 Assumption
 Pairs(d,w) are assumed to be
generated independently
 Conditioned on z, w is generated
independently of d
 Words in a document are
exchangeable
 Documents are exchangeable
 Latent topics z are independent
The generative process
∑∑ ∈∈ ZzZz
dzpzwpdpdzwpdpdpdwpwdp )|()|()(=)|,()(=)()|(=),(
Multinomial Distribution
Multinomial Distribution
One layer of ‘Deep
Neutral Network’

Topic Modeling
10
 Latent Dirichlet Allocation (LDA)
 David M. Blei, Andrew Y. Ng, Michael I. Jordan
 Hierarchical Bayesian model; Bayesian pLSI
θ z w
N
M
α
β
iterative times
Generative process of LDA
 Choose N ~ Poisson(𝜉);
 For each document d={𝑤1, 𝑤2 … 𝑤 𝑛}
Choose 𝜃 ~𝐷𝑖𝑟(𝛼); For each of the N
words 𝑤 𝑛 in d:
a) Choose a topic 𝑧 𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃
b) Choose a word 𝑤 𝑛 from 𝑝 𝑤 𝑛 𝑧 𝑛, 𝛽 ,
a multinomial distribution conditioned on 𝑧 𝑛

 Gibbs Sampling (MCMC, Markov Chain Monte Carlo)
 ‘I want to know a distribution, but I haven’t known yet, so I find a
way to generate its samples’
 300 lines (code) for LDA, not complex but solid
lim
𝑛→∞
𝜋0 𝑃 𝑛
=
𝜋(1) … 𝜋(|𝑆|)
⋮ ⋮ ⋮
𝜋(1) 𝜋(|𝑆|)
 𝜋 = {𝜋 1 , 𝜋 2 , … , 𝜋 𝑗 , … , 𝜋(|𝑆|)}
Topic Modeling
 Parameter Estimation
 Variational Inference (+EM) ：Complex, rarely use
 ‘I want to know a distribution, but I haven’t known yet, so I find a
similar distribution (tight upper bound or lower bound)’
 K-L divergence (or information gain)
11
Stationary Distribution

Hierarchical Topic Modeling
Topic modeling is not enough
12
Hierarchical
Structure

13
Chinese Restaurant Process (Dirichlet Process)
 A restaurant with an infinite number of tables, and
customers (word) enter this restaurant sequentially. The ith
customer (𝜃𝑖) sits at a table (𝜙 𝑘) according to the probability
𝜙 𝑘: Clustering == 1/2 unsupervised learning  clustering, topic modeling (two layer
clustering), hierarchical concept building, collaborative filtering, similarity computation…

14
 The generative process (nested CRP)
 Focus on the insight
1. Let 𝑐1 be the root restaurant (only one table)
2. For each level 𝑙 ∈ {2, … , 𝐿}:
Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to
by that table
3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)
4. For each word 𝑤 𝑛:
Draw 𝑧 ∈ 1, … , 𝐿 ~ Mult(𝜃)
Draw 𝑤 𝑛 from the topic associated with restaurant 𝑐 𝑧
α
zm,n
N
c1
c2
cL
T
γ
wm,n
M
β
k


m


Matryoshka
(Russia) Doll

Examples
15
root topic analysis obtain base system concentration
thermal
polymer acid
property
diamine
activity compound acid
derivative active
compound ligand group
investigate synergistic
reaction
derivative
yield synthesis
microwave
assay food quality content
analysis
decoction
component
radix quality
constituent
compound
activity
synthesize salt
derivative
antioxidant
activity extract
inhibitory
flavonoid
interaction
cation metal
energy
solution

What I do
Topic-specific opinion mining
 Goal: automatically learn which group of aspects people like,
dislike, and how people like, and why people like
 Methods: topic model (LDA), Dirichlet process, Gibbs sampling,
etc.
Collaborative recommendation
 Goal: automatically learn which group of products people like,
dislike, and how people like, and why people like
 Methods: matrix factorization, gradient descent, regularization
norm, etc.
 Common basics: Bayesian inference (MLE, MAP, PGM)
16

Supplement
17
Some supplements
 Probabilistic Graphical Model
 Modeling Bayesian Network using plates and circles
 Generative Model & Discriminative Model: 𝑝(𝜃|𝑋/𝐷𝑎𝑡𝑎)
 Generative Model: p(θ|X) ∝ p(X|θ)p(θ)
- Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning
 Discriminative Model: 𝑝(𝜃|𝑋)
- LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning
Also can be represented by
graphical models

Reference
 My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)
 ‘Topic modeling (an introduction)’
 ‘Non-parametric Bayesian learning in discrete data’
 ‘The research of topic modeling in text mining’
 ‘Matrix factorization with user generated content’
 …, etc
 Website
 You can download all slides of mine
 http://web.xidian.edu.cn/ysxu/teach.html
 http://liu.cs.uic.edu/yueshenxu/
 http://www.slideshare.net/obamaxys2011
 https://www.researchgate.net/profile/Yueshen_Xu
18

Reference
• David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003
• Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007
• Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical
Association, 2006
• David Blei. Probabilstic topic models. Communications of the ACM, 2012
• David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of
Topic Hierarchies. Journal of the ACM, 2010
• Gregor Heinrich. Parameter Estimation for Text Analysis, 2008
• T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals
of Statistics, 1973
• Martin J. Wainwright. Graphical Models, Exponential Families, and Variational
Inference
• Rick Durrett. Probability: Theory and Examples, 2010
• Christopher Bishop. Pattern Recognition and Machine Learning, 2007
• Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014
19

Software Engineering2016/12/29 20
Q&A

(Hierarchical) Topic Modeling_Yueshen Xu

Recommended

Recommended

More Related Content

Similar to (Hierarchical) Topic Modeling_Yueshen Xu

Similar to (Hierarchical) Topic Modeling_Yueshen Xu (20)

More from Yueshen Xu

More from Yueshen Xu (20)

(Hierarchical) Topic Modeling_Yueshen Xu