An online semantic enhanced dirichlet model for short text

University of Electronic Science
& Technology of China
数据挖掘实验室
电子科技大学
Data Mining Lab
An Online Semantic-enhanced Dirichlet
Model for Short Text Stream Clustering
Jay Kumar, Junming Shao, Salah Ud Din and Wazir Ali
Dated: July 6, 2020
The 58th Annual Meeting of the Association for Computational Linguistics

Outline
• Motivation
• Existing Problems
• Proposed Model
• Experimentations
• Conclusion
2

Motivation
• Short-Text data generated by many online
sources
Clustering / Topic
Modeling
3

Motivation
• News Recommendation, Hot-topic Detection,
Advertising, Opinion Mining
Clustering / Topic
Modeling
4

Previous Algorithms
• Similarity Based
– HPStream (Aggarwal et al., 2003)
– FW-Kmeans (Jing et al., 2005)
– Den-Stream (Cao et al., 2006)
– SPKM (Zhong et al., 2005)
– ConStream (Aggarwal et al., 2010)
• Require pre-defined similarity threshold
• Scarsity and Scalability issues
• Model Based
– LDA (Blei et al., 2003)
– DTM (Blei et al., 2006)
– TDPM (Ahmed et al., 2008)
– TM-LDA (Wang et al., 2012)
– DPMFP (Huang et al., 2013)
– NPMM (Chen et al.,2019)
– MStream (Yin et al., 2018)
• Batch way processing
• Lack of semantic embedding 5

Challenges
• Semantic Information
– Term ambiguity
• context of a word change related to
accompanied words
– short length document contains
less supportive terms
• Concept Drift
– Change in topics over time
– Life-span of topics differs
• Batch v/s Online Processing
– Batch Processing
• Assumption: no concept drift inside a
batch
Lady
Gaga
Perform
show
The
wicked
Lady
show
6
…
Temporal Dependencies
B𝑛 B𝑛+1
𝑠𝑡𝑟𝑒𝑎𝑚 = {𝑑𝑡}𝑡=1
∞

Proposed Model
• Propose an online semantic-
enhanced Dirichlet Model (OSDM)
– Non-parametric Probabilistic graphical
model
– Automatic detection of topics
– Online Clustering
• Maintain active topics online
– Semantic Information
7
𝑤𝑖
𝑤𝑗
𝑤
𝒟
𝒩
∞
β
α
θ
𝓏
𝑤𝑖
𝑤𝑗
𝑤
𝒟
𝒩
∞
β
α
θ
𝓏
online
Basic Idea:
• An co-occurrence matrix is embedded to remove term ambiguity
• Inverse Cluster frequency of singular term is used as semantic
smoothing

𝑤𝑖
𝑤𝑗
𝑤
𝒟
𝒩
∞
β
α
θ
𝓏
}
8
OSDM PGM
• Probabilistic Model Formulation
– Prior probability
– new cluster
– Existing cluster
𝑝(𝑧𝑑 = 𝑧 |𝑧, 𝛼)
𝑚𝑧
𝐷−1+𝛼𝐷
𝛼𝐷
𝐷 − 1 + 𝛼𝐷
𝑚𝑧 ∶ # of docs in cluster
𝐷 ∶ active documents

𝑤𝑖
𝑤𝑗
𝑤
𝒟
𝒩
∞
β
α
θ
𝓏
9
OSDM PGM
– Probability to calculate similarity b/w
cluster and document 𝑝(𝑑 | 𝑧𝑑 = 𝑧, 𝑑𝑧→𝑑, 𝛽)
– Multinomial distribution
– new cluster
𝑝(𝑑
|
𝑧
𝑑
=
𝑧,
𝑑
𝑧→𝑑
,
𝛽)
𝑤𝜖𝑑 𝑗=1
𝑁𝑑
𝑤
𝑛𝑧,¬𝑑
𝑤
.𝐼𝐶𝐹𝑤+𝛽+𝑗−1
𝑖=1
𝑁𝑑 𝑛𝑧,¬𝑑+𝑉𝛽+𝑖−1
.
𝑤∈𝑑
𝑗=1
𝑁𝑑
𝑤
𝛽 + 𝑗 − 1
𝑖=1
𝑁𝑑
𝑉𝛽 + 𝑖 − 1

10
OSDM PGM
– Probability to calculate similarity b/w
cluster and document 𝑝(𝑑 | 𝑧𝑑 = 𝑧, 𝑑𝑧→𝑑, 𝛽)
– Multinomial distribution
– new cluster
𝑤𝜖𝑑 𝑗=1
𝑁𝑑
𝑤
𝑛𝑧,¬𝑑
𝑤
𝑖=1
.
𝑤∈𝑑
𝑗=1
𝑁𝑑
𝑤
𝛽 + 𝑗 − 1
𝑖=1
𝑁𝑑
𝑉𝛽 + 𝑖 − 1
𝐼𝐶𝐹(𝑤 ∈ 𝑑) = log(
|𝑍|
|𝑤𝜖𝑍|
)
𝑐𝑤𝑖𝑗 = 𝑑′⊆𝑧
𝑛𝑑′
𝑤𝑖
𝑑′⊆𝑧
𝑛𝑑′
𝑤𝑖
+
𝑑′⊆𝑧
𝑛𝑑′
𝑤𝑗
s.t. (𝑤𝑖, 𝑤𝑗) ∈ 𝑑′

• Probability for choosing existing cluster
• Probability for creating new cluster
OSDM
Active Topics
T1
T2
𝑝𝑒 𝑧 → 𝑑 =
𝑝𝑒 𝑇1 → 𝑑𝑡
𝑝𝑒 𝑇2 → 𝑑𝑡
11
𝑤𝜖𝑑 𝑗=1
𝑁𝑑
𝑤
𝑛𝑧,¬𝑑
𝑤
𝑖=1
.
𝑚𝑧
𝐷−1+𝛼𝐷
.
𝑝𝑛(𝑧𝑛𝑒𝑤 → 𝑑) = (
𝛼𝐷
𝐷 − 1 + 𝛼𝐷
) ⋅ ( 𝑤∈𝑑
𝑗=1
𝑁𝑑
𝑤
𝛽 + 𝑗 − 1
𝑖=1
𝑁𝑑
𝑉𝛽 + 𝑖 − 1
)
𝑝𝑛 𝑇𝑛𝑒𝑤 → 𝑑𝑡
𝑑𝑡

• Exponential decay
– each cluster have initial decay score = 1
Remove Outdated Clusters
𝑙𝑧 = 𝑙𝑧 × 2−𝜆(△𝑡𝑖𝑚𝑒)
Topics = 3 Topics = 4 Topics = 2
Time = 𝑛 Time = 𝑛′ Time = 𝑛′′
12

Algorithm: OSDM
Input: 𝑆𝑡: {𝑑𝑡}𝑡=1
∞
, 𝛼: concentration parameter, β: pseudo weight of term in cluster, 𝜆: decay
factor
Output: Cluster assignments 𝑧𝑑
1: 𝐾 = 𝜙
2: while 𝑑𝑡 𝑖𝑛 𝑆𝑡 do
3: 𝑡 = 𝑡 + 1
4: 𝐾 = 𝑟𝑒𝑚𝑜𝑣𝑒𝑂𝑙𝑑𝑍𝑖 (𝐾) // delete cluster whose 𝑙𝑧 ≈ 0
5: 𝐾 = 𝑟𝑒𝑑𝑢𝑐𝑒𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑊𝑒𝑖𝑔ℎ𝑡(𝜆 , 𝐾)
6: foreach 𝑧𝑖 ∈ 𝐾 do
7: 𝑃𝑧𝑖 = 𝑝𝑟𝑜𝑏(𝑧𝑖, 𝑑𝑡) using
8: end
9: 𝑖 = 𝐴𝑟𝑔𝑚𝑎𝑥(𝑃𝑧𝑖)
𝑖
10: 𝑃𝑧𝑛 = 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒
11: if 𝑃𝑧𝑖 < 𝑃𝑧𝑛 then
12: 𝑚𝑧𝑛
= 1 , 𝑙𝑧𝑛 = 1 , 𝑢𝑧𝑛 = 𝑡 , 𝑛𝑧𝑛
𝑤
= 𝑁𝑑𝑡
𝑤
, 𝑐𝑤𝑧𝑛 = 𝑐𝑤𝑑𝑡 , 𝑙𝑒𝑛𝑧𝑛 = 𝑙𝑒𝑛𝑑𝑡
13: 𝐾 = 𝐾 ∪ 𝑧𝑛
14: else
15: 𝑚𝑧𝑖
= 𝑚𝑧𝑖
+ 1 , 𝑙𝑧𝑖 = 1 , 𝑢𝑧𝑛 = 𝑡 , 𝑛𝑧𝑖
𝑤
= 𝑛𝑧𝑖
𝑤
+𝑁𝑑𝑡
𝑤
, 𝑐𝑤𝑧𝑖 = 𝑐𝑤𝑧𝑖 ∪ 𝑐𝑤𝑑𝑡
,
𝑙𝑒𝑛𝑧𝑖 = 𝑙𝑒𝑛𝑧𝑖 + 𝑙𝑒𝑛𝑧𝑡
16: end
17: end
𝑤𝜖𝑑 𝑗=1
𝑁𝑑
𝑤
𝑛𝑧,¬𝑑
𝑤
𝑖=1
.
𝑚𝑧
𝐷−1+𝛼𝐷
.
(
𝛼𝐷
𝐷 − 1 + 𝛼𝐷
) ⋅ ( 𝑤∈𝑑
𝑗=1
𝑁𝑑
𝑤
𝛽 + 𝑗 − 1
𝑖=1
𝑁𝑑
𝑉𝛽 + 𝑖 − 1
)
13

Complexity
• Space Complexity
– 𝑂(𝐾 𝑉 + 𝑉2 + 𝑉𝐷)
• 𝐾 average number of active topics/clusters
• 𝐷 number of active documents
• 𝑉 number of active vocabulary
• Each cluster store average 𝑉 vocabulary
• Co-occurrence matrix in a cluster at most 𝑉 . 𝑉
• Time Complexity
– 𝑂(𝐾(ℒ𝑉))
• ℒ is average length of each arriving document
14

Experimentation
• Dataset
• Baselines
– DTM (Blei and Lafferty, 2006)
– Sumblr (Sb.) (Shou et al., 2013)
– DMM (Yin and Wang, 2014)
– MStreamF (Yin et al., 2018) online model (MF-O) and iterative model
(MF-G)
Dataset Docs Vocab Topics
News 11109 8110 152
News-T 11109 8110 152
Tweets 30322 12301 269
Reuters 9,447 32303 66
Reuters-T 9,447 32303 66
16

Clustering Performance Evaluation
17

Performance over Stream
18
News News-T Reuters
Reuters-T Tweets

Parameter Sensitivity
𝛼 − 𝑠𝑒𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
19
𝛽 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
𝜆 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦

Conclusion
• Online semantic-enhanced Dirichlet Model
• Properties
– work online instead of batch-way
– remove term ambiguity with semantic information
– can automatically detect number of topics over
time
– can maintain evolving clusters over time
• Create new clusters
• Remove outdated clusters
20

Thank you!
An Online Semantic-enhanced Dirichlet Model for Short Text
Stream Clustering
Acknowledgment: This work is supported by the National Natural Science
Foundation of China (61976044), Fundamental Research Funds for the Central
Universities (ZYGX2019Z014), Fok Ying-Tong Education Foundation for Young
Teachers in the Higher Education Institutions of China (161062), National key
research and development program (2016YFB0502300).
Any Questions?
21

References
• David M. Blei and John D. Lafferty. 2006. Dynamic topic models. ACM International Conference
Proceeding Series, 148:113–120.
• Lidan Shou, Zhenhua Wang, Ke Chen, and Gang Chen. 2013. Sumblr Continuous Summarization of
Evolving Tweet Streams. In Proceedings of the 6th international ACM SIGIR conference on Research
and development in information retrieval, pages 533– 542.
• Jianhua Yin and Jianyong Wang. 2014. A dirichlet multinomial mixture model-based approach for
short text clustering. In Proceedings of the 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 233–242.
• Jianhua Yin, Daren Chao, Zhongkun Liu, Wei Zhang, Xiaohui Yu, and Jianyong Wang. 2018.
Modelbased Clustering of Short Text Streams. In ACM International Conference on Knowledge
Discovery and Data Mining, pages 2634–2642.
• Amir Hadifar, Lucas Sterckx, Thomas Demeester, and Chris Develder. 2019. A Self-Training Approach
for Short Text Clustering. In Proceedings of the 4th Workshop on Representation Learning for NLP
(ACL), pages 194–199.
• Junyang Chen, Zhiguo Gong, and Weiwen Liu. 2019. A nonparametric model for online topic
discovery with word embeddings. Information Sciences, 504:32–47.
• Hongyu Gong, Tarek Sakakini, Suma Bhat, and Jinjun Xiong. 2018. Document similarity for texts of
varying lengths via hidden topics. In Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics, volume 1, pages 2341–2351.
• Yukun Zhao, Shangsong Liang, Zhaochun Ren, Jun Ma, Emine Yilmaz, and Maarten de Rijke. 2016.
Explainable User Clustering in Short Text Streams. In International ACM conference on Research
and Development in Information Retrieval, pages 155–164.
• Shi Zhong. 2005. Efficient streaming text clustering. Neural Networks, 18(5-6):790–798.
22

OSDM
𝑤𝑖
𝑤𝑗
𝑤
𝒟
𝒩
∞
β
α
θ
𝓏
𝑤𝑖
𝑤𝑗
𝑤
𝒟
𝒩
∞
β
α
θ
𝓏
online
topic
Co-occurrence
Unigram /
Singular words
Mutlinomial Distribution
Dirichlet Process
23

Challenges
• Batch-wise v/s Online
– unknown number of docs per time unit
– different life-span of Topics
– distribution of topics change over time
…
Stream
…

Challenges
• Batch-wise versus Online
– Batch: divide a stream into fixed size chunk
– Assumption: no concept drift inside a batch
– documents far away from each other on time
unit scale have weaker relationship
– previous models work in batch way
…
Temporal Dependencies
B𝑛 B𝑛+1
Fixed size Batch
Stream
25
𝑠𝑡𝑟𝑒𝑎𝑚 = {𝑑𝑡}𝑡=1
∞

Conclusion
• OSDM dynamically assigns each arriving
document into an existing cluster or
generating a new cluster
• Maintain active topics
• OSDM remove the term ambiguity problem
using semantic information in the probabilistic
graphical model
• Promising high-quality clustering results
26

Batch-wise versus Online
Topics = 3 Topics = 4 Topics = 2
Time = 𝑛 Time = 𝑛′ Time = 𝑛′′
27
𝑐𝑖 = { 𝑑1
𝑐𝑖
, 𝑑2
𝑐𝑖
, … 𝑑𝑛
𝑐𝑖
}
𝑐𝑖 ∩ 𝑐𝑗 = ∅

An online semantic enhanced dirichlet model for short text

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to An online semantic enhanced dirichlet model for short text

Similar to An online semantic enhanced dirichlet model for short text (20)

Recently uploaded

Recently uploaded (20)

An online semantic enhanced dirichlet model for short text