SlideShare a Scribd company logo
University of Electronic Science
& Technology of China
数据挖掘实验室
电 子 科 技 大 学
Data Mining Lab
An Online Semantic-enhanced Dirichlet
Model for Short Text Stream Clustering
Jay Kumar, Junming Shao, Salah Ud Din and Wazir Ali
Dated: July 6, 2020
The 58th Annual Meeting of the Association for Computational Linguistics
Outline
• Motivation
• Existing Problems
• Proposed Model
• Experimentations
• Conclusion
2
Motivation
• Short-Text data generated by many online
sources
Clustering / Topic
Modeling
3
Motivation
• News Recommendation, Hot-topic Detection,
Advertising, Opinion Mining
Clustering / Topic
Modeling
4
Previous Algorithms
• Similarity Based
– HPStream (Aggarwal et al., 2003)
– FW-Kmeans (Jing et al., 2005)
– Den-Stream (Cao et al., 2006)
– SPKM (Zhong et al., 2005)
– ConStream (Aggarwal et al., 2010)
• Require pre-defined similarity threshold
• Scarsity and Scalability issues
• Model Based
– LDA (Blei et al., 2003)
– DTM (Blei et al., 2006)
– TDPM (Ahmed et al., 2008)
– TM-LDA (Wang et al., 2012)
– DPMFP (Huang et al., 2013)
– NPMM (Chen et al.,2019)
– MStream (Yin et al., 2018)
• Batch way processing
• Lack of semantic embedding 5
Challenges
• Semantic Information
– Term ambiguity
• context of a word change related to
accompanied words
– short length document contains
less supportive terms
• Concept Drift
– Change in topics over time
– Life-span of topics differs
• Batch v/s Online Processing
– Batch Processing
• Assumption: no concept drift inside a
batch
Lady
Gaga
Perform
show
The
wicked
Lady
show
6
…
Temporal Dependencies
B𝑛 B𝑛+1
𝑠𝑡𝑟𝑒𝑎𝑚 = {𝑑𝑡}𝑡=1
∞
Proposed Model
• Propose an online semantic-
enhanced Dirichlet Model (OSDM)
– Non-parametric Probabilistic graphical
model
– Automatic detection of topics
– Online Clustering
• Maintain active topics online
– Semantic Information
7
𝑤𝑖
𝑤𝑗
𝑤
𝒟
𝒩
∞
β
α
θ
𝓏
𝑤𝑖
𝑤𝑗
𝑤
𝒟
𝒩
∞
β
α
θ
𝓏
online
Basic Idea:
• An co-occurrence matrix is embedded to remove term ambiguity
• Inverse Cluster frequency of singular term is used as semantic
smoothing
𝑤𝑖
𝑤𝑗
𝑤
𝒟
𝒩
∞
β
α
θ
𝓏
}
8
OSDM PGM
• Probabilistic Model Formulation
– Prior probability
– new cluster
– Existing cluster
𝑝(𝑧𝑑 = 𝑧 |𝑧, 𝛼)
𝑚𝑧
𝐷−1+𝛼𝐷
𝛼𝐷
𝐷 − 1 + 𝛼𝐷
𝑚𝑧 ∶ # of docs in cluster
𝐷 ∶ active documents
𝑤𝑖
𝑤𝑗
𝑤
𝒟
𝒩
∞
β
α
θ
𝓏
9
OSDM PGM
• Probabilistic Model Formulation
– Probability to calculate similarity b/w
cluster and document 𝑝(𝑑 | 𝑧𝑑 = 𝑧, 𝑑𝑧→𝑑, 𝛽)
– Multinomial distribution
– new cluster
– Existing cluster
𝑝(𝑑
|
𝑧
𝑑
=
𝑧,
𝑑
𝑧→𝑑
,
𝛽)
𝑤𝜖𝑑 𝑗=1
𝑁𝑑
𝑤
𝑛𝑧,¬𝑑
𝑤
.𝐼𝐶𝐹𝑤+𝛽+𝑗−1
𝑖=1
𝑁𝑑 𝑛𝑧,¬𝑑+𝑉𝛽+𝑖−1
.
𝑤∈𝑑
𝑗=1
𝑁𝑑
𝑤
𝛽 + 𝑗 − 1
𝑖=1
𝑁𝑑
𝑉𝛽 + 𝑖 − 1
10
OSDM PGM
• Probabilistic Model Formulation
– Probability to calculate similarity b/w
cluster and document 𝑝(𝑑 | 𝑧𝑑 = 𝑧, 𝑑𝑧→𝑑, 𝛽)
– Multinomial distribution
– new cluster
– Existing cluster
𝑤𝜖𝑑 𝑗=1
𝑁𝑑
𝑤
𝑛𝑧,¬𝑑
𝑤
.𝐼𝐶𝐹𝑤+𝛽+𝑗−1
𝑖=1
𝑁𝑑 𝑛𝑧,¬𝑑+𝑉𝛽+𝑖−1
.
𝑤∈𝑑
𝑗=1
𝑁𝑑
𝑤
𝛽 + 𝑗 − 1
𝑖=1
𝑁𝑑
𝑉𝛽 + 𝑖 − 1
𝐼𝐶𝐹(𝑤 ∈ 𝑑) = log(
|𝑍|
|𝑤𝜖𝑍|
)
𝑐𝑤𝑖𝑗 = 𝑑′⊆𝑧
𝑛𝑑′
𝑤𝑖
𝑑′⊆𝑧
𝑛𝑑′
𝑤𝑖
+
𝑑′⊆𝑧
𝑛𝑑′
𝑤𝑗
s.t. (𝑤𝑖, 𝑤𝑗) ∈ 𝑑′
• Probability for choosing existing cluster
• Probability for creating new cluster
OSDM
Active Topics
T1
T2
𝑝𝑒 𝑧 → 𝑑 =
𝑝𝑒 𝑇1 → 𝑑𝑡
𝑝𝑒 𝑇2 → 𝑑𝑡
11
𝑤𝜖𝑑 𝑗=1
𝑁𝑑
𝑤
𝑛𝑧,¬𝑑
𝑤
.𝐼𝐶𝐹𝑤+𝛽+𝑗−1
𝑖=1
𝑁𝑑 𝑛𝑧,¬𝑑+𝑉𝛽+𝑖−1
.
𝑚𝑧
𝐷−1+𝛼𝐷
.
𝑝𝑛(𝑧𝑛𝑒𝑤 → 𝑑) = (
𝛼𝐷
𝐷 − 1 + 𝛼𝐷
) ⋅ ( 𝑤∈𝑑
𝑗=1
𝑁𝑑
𝑤
𝛽 + 𝑗 − 1
𝑖=1
𝑁𝑑
𝑉𝛽 + 𝑖 − 1
)
𝑝𝑛 𝑇𝑛𝑒𝑤 → 𝑑𝑡
𝑑𝑡
• Exponential decay
– each cluster have initial decay score = 1
Remove Outdated Clusters
𝑙𝑧 = 𝑙𝑧 × 2−𝜆(△𝑡𝑖𝑚𝑒)
Topics = 3 Topics = 4 Topics = 2
Time = 𝑛 Time = 𝑛′ Time = 𝑛′′
12
Algorithm: OSDM
Input: 𝑆𝑡: {𝑑𝑡}𝑡=1
∞
, 𝛼: concentration parameter, β: pseudo weight of term in cluster, 𝜆: decay
factor
Output: Cluster assignments 𝑧𝑑
1: 𝐾 = 𝜙
2: while 𝑑𝑡 𝑖𝑛 𝑆𝑡 do
3: 𝑡 = 𝑡 + 1
4: 𝐾 = 𝑟𝑒𝑚𝑜𝑣𝑒𝑂𝑙𝑑𝑍𝑖 (𝐾) // delete cluster whose 𝑙𝑧 ≈ 0
5: 𝐾 = 𝑟𝑒𝑑𝑢𝑐𝑒𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑊𝑒𝑖𝑔ℎ𝑡(𝜆 , 𝐾)
6: foreach 𝑧𝑖 ∈ 𝐾 do
7: 𝑃𝑧𝑖 = 𝑝𝑟𝑜𝑏(𝑧𝑖, 𝑑𝑡) using
8: end
9: 𝑖 = 𝐴𝑟𝑔𝑚𝑎𝑥(𝑃𝑧𝑖)
𝑖
10: 𝑃𝑧𝑛 = 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒
11: if 𝑃𝑧𝑖 < 𝑃𝑧𝑛 then
12: 𝑚𝑧𝑛
= 1 , 𝑙𝑧𝑛 = 1 , 𝑢𝑧𝑛 = 𝑡 , 𝑛𝑧𝑛
𝑤
= 𝑁𝑑𝑡
𝑤
, 𝑐𝑤𝑧𝑛 = 𝑐𝑤𝑑𝑡 , 𝑙𝑒𝑛𝑧𝑛 = 𝑙𝑒𝑛𝑑𝑡
13: 𝐾 = 𝐾 ∪ 𝑧𝑛
14: else
15: 𝑚𝑧𝑖
= 𝑚𝑧𝑖
+ 1 , 𝑙𝑧𝑖 = 1 , 𝑢𝑧𝑛 = 𝑡 , 𝑛𝑧𝑖
𝑤
= 𝑛𝑧𝑖
𝑤
+𝑁𝑑𝑡
𝑤
, 𝑐𝑤𝑧𝑖 = 𝑐𝑤𝑧𝑖 ∪ 𝑐𝑤𝑑𝑡
,
𝑙𝑒𝑛𝑧𝑖 = 𝑙𝑒𝑛𝑧𝑖 + 𝑙𝑒𝑛𝑧𝑡
16: end
17: end
𝑤𝜖𝑑 𝑗=1
𝑁𝑑
𝑤
𝑛𝑧,¬𝑑
𝑤
.𝐼𝐶𝐹𝑤+𝛽+𝑗−1
𝑖=1
𝑁𝑑 𝑛𝑧,¬𝑑+𝑉𝛽+𝑖−1
.
𝑚𝑧
𝐷−1+𝛼𝐷
.
(
𝛼𝐷
𝐷 − 1 + 𝛼𝐷
) ⋅ ( 𝑤∈𝑑
𝑗=1
𝑁𝑑
𝑤
𝛽 + 𝑗 − 1
𝑖=1
𝑁𝑑
𝑉𝛽 + 𝑖 − 1
)
13
Complexity
• Space Complexity
– 𝑂(𝐾 𝑉 + 𝑉2 + 𝑉𝐷)
• 𝐾 average number of active topics/clusters
• 𝐷 number of active documents
• 𝑉 number of active vocabulary
• Each cluster store average 𝑉 vocabulary
• Co-occurrence matrix in a cluster at most 𝑉 . 𝑉
• Time Complexity
– 𝑂(𝐾(ℒ𝑉))
• ℒ is average length of each arriving document
14
Experimental and Results
15
Experimentation
• Dataset
• Baselines
– DTM (Blei and Lafferty, 2006)
– Sumblr (Sb.) (Shou et al., 2013)
– DMM (Yin and Wang, 2014)
– MStreamF (Yin et al., 2018) online model (MF-O) and iterative model
(MF-G)
Dataset Docs Vocab Topics
News 11109 8110 152
News-T 11109 8110 152
Tweets 30322 12301 269
Reuters 9,447 32303 66
Reuters-T 9,447 32303 66
16
Clustering Performance Evaluation
17
Performance over Stream
18
News News-T Reuters
Reuters-T Tweets
Parameter Sensitivity
𝛼 − 𝑠𝑒𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
19
𝛽 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
𝜆 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
Conclusion
• Online semantic-enhanced Dirichlet Model
• Properties
– work online instead of batch-way
– remove term ambiguity with semantic information
– can automatically detect number of topics over
time
– can maintain evolving clusters over time
• Create new clusters
• Remove outdated clusters
20
Thank you!
An Online Semantic-enhanced Dirichlet Model for Short Text
Stream Clustering
Acknowledgment: This work is supported by the National Natural Science
Foundation of China (61976044), Fundamental Research Funds for the Central
Universities (ZYGX2019Z014), Fok Ying-Tong Education Foundation for Young
Teachers in the Higher Education Institutions of China (161062), National key
research and development program (2016YFB0502300).
Any Questions?
21
References
• David M. Blei and John D. Lafferty. 2006. Dynamic topic models. ACM International Conference
Proceeding Series, 148:113–120.
• Lidan Shou, Zhenhua Wang, Ke Chen, and Gang Chen. 2013. Sumblr Continuous Summarization of
Evolving Tweet Streams. In Proceedings of the 6th international ACM SIGIR conference on Research
and development in information retrieval, pages 533– 542.
• Jianhua Yin and Jianyong Wang. 2014. A dirichlet multinomial mixture model-based approach for
short text clustering. In Proceedings of the 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 233–242.
• Jianhua Yin, Daren Chao, Zhongkun Liu, Wei Zhang, Xiaohui Yu, and Jianyong Wang. 2018.
Modelbased Clustering of Short Text Streams. In ACM International Conference on Knowledge
Discovery and Data Mining, pages 2634–2642.
• Amir Hadifar, Lucas Sterckx, Thomas Demeester, and Chris Develder. 2019. A Self-Training Approach
for Short Text Clustering. In Proceedings of the 4th Workshop on Representation Learning for NLP
(ACL), pages 194–199.
• Junyang Chen, Zhiguo Gong, and Weiwen Liu. 2019. A nonparametric model for online topic
discovery with word embeddings. Information Sciences, 504:32–47.
• Hongyu Gong, Tarek Sakakini, Suma Bhat, and Jinjun Xiong. 2018. Document similarity for texts of
varying lengths via hidden topics. In Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics, volume 1, pages 2341–2351.
• Yukun Zhao, Shangsong Liang, Zhaochun Ren, Jun Ma, Emine Yilmaz, and Maarten de Rijke. 2016.
Explainable User Clustering in Short Text Streams. In International ACM conference on Research
and Development in Information Retrieval, pages 155–164.
• Shi Zhong. 2005. Efficient streaming text clustering. Neural Networks, 18(5-6):790–798.
22
OSDM
𝑤𝑖
𝑤𝑗
𝑤
𝒟
𝒩
∞
β
α
θ
𝓏
𝑤𝑖
𝑤𝑗
𝑤
𝒟
𝒩
∞
β
α
θ
𝓏
online
topic
Co-occurrence
Unigram /
Singular words
Mutlinomial Distribution
Dirichlet Process
23
Challenges
• Batch-wise v/s Online
– unknown number of docs per time unit
– different life-span of Topics
– distribution of topics change over time
…
Stream
…
Challenges
• Batch-wise versus Online
– Batch: divide a stream into fixed size chunk
– Assumption: no concept drift inside a batch
– documents far away from each other on time
unit scale have weaker relationship
– previous models work in batch way
…
Temporal Dependencies
B𝑛 B𝑛+1
Fixed size Batch
Stream
25
𝑠𝑡𝑟𝑒𝑎𝑚 = {𝑑𝑡}𝑡=1
∞
Conclusion
• OSDM dynamically assigns each arriving
document into an existing cluster or
generating a new cluster
• Maintain active topics
• OSDM remove the term ambiguity problem
using semantic information in the probabilistic
graphical model
• Promising high-quality clustering results
26
Batch-wise versus Online
Topics = 3 Topics = 4 Topics = 2
Time = 𝑛 Time = 𝑛′ Time = 𝑛′′
27
𝑐𝑖 = { 𝑑1
𝑐𝑖
, 𝑑2
𝑐𝑖
, … 𝑑𝑛
𝑐𝑖
}
𝑐𝑖 ∩ 𝑐𝑗 = ∅

More Related Content

What's hot

What makes a linked data pattern interesting?
What makes a linked data pattern interesting?What makes a linked data pattern interesting?
What makes a linked data pattern interesting?
Szymon Klarman
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
Polytechnic University of Bari
 
Ibm cognitive seminar march 2015 watsonsim final
Ibm cognitive seminar march 2015  watsonsim finalIbm cognitive seminar march 2015  watsonsim final
Ibm cognitive seminar march 2015 watsonsim final
diannepatricia
 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
Arnab Bhadury
 
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural AttentionAttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
Kodaira Tomonori
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Polytechnic University of Bari
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
DataTactics
 
Neural Summarization by Extracting Sentences and Words
Neural Summarization by Extracting Sentences and WordsNeural Summarization by Extracting Sentences and Words
Neural Summarization by Extracting Sentences and Words
Kodaira Tomonori
 
Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...
inscit2006
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
Bhaskar Mitra
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
jins0618
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
Elsevier
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Dustin Smith
 
20130622 okfn hackathon t2
20130622 okfn hackathon t220130622 okfn hackathon t2
20130622 okfn hackathon t2
Seonho Kim
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
mustafa sarac
 

What's hot (15)

What makes a linked data pattern interesting?
What makes a linked data pattern interesting?What makes a linked data pattern interesting?
What makes a linked data pattern interesting?
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Ibm cognitive seminar march 2015 watsonsim final
Ibm cognitive seminar march 2015  watsonsim finalIbm cognitive seminar march 2015  watsonsim final
Ibm cognitive seminar march 2015 watsonsim final
 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
 
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural AttentionAttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Neural Summarization by Extracting Sentences and Words
Neural Summarization by Extracting Sentences and WordsNeural Summarization by Extracting Sentences and Words
Neural Summarization by Extracting Sentences and Words
 
Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
20130622 okfn hackathon t2
20130622 okfn hackathon t220130622 okfn hackathon t2
20130622 okfn hackathon t2
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
 

Similar to An online semantic enhanced dirichlet model for short text

Model of semantic textual document clustering
Model of semantic textual document clusteringModel of semantic textual document clustering
Model of semantic textual document clustering
SK Ahammad Fahad
 
Writing a scientific manuscript
Writing a scientific manuscriptWriting a scientific manuscript
Writing a scientific manuscript
Martin McMorrow
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
BaoTramDuong2
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Association for Computational Linguistics
 
Big Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learningBig Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learning
Julien TREGUER
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
Ian Foster
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
Paul Groth
 
Temporal profiles of avalanches on networks
Temporal profiles of avalanches on networksTemporal profiles of avalanches on networks
Temporal profiles of avalanches on networks
James Gleeson
 
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
Waqas Nawaz
 
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Hendrik Drachsler
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
Angelo Salatino
 
Ngsp
NgspNgsp
Ngsp
Tim Clark
 
Effects of Network Structure, Competition and Memory Time on Social Spreading...
Effects of Network Structure, Competition and Memory Time on Social Spreading...Effects of Network Structure, Competition and Memory Time on Social Spreading...
Effects of Network Structure, Competition and Memory Time on Social Spreading...
James Gleeson
 
Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics
Angelo Salatino
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
Ian Foster
 
Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability ...
Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability ...Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability ...
Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability ...
GigaScience, BGI Hong Kong
 
Ethics reproducibility and data stewardship
Ethics reproducibility and data stewardshipEthics reproducibility and data stewardship
Ethics reproducibility and data stewardship
Russell Jarvis
 
Search, Discovery and Analysis of Sensory Data Streams
Search, Discovery and Analysis of Sensory Data StreamsSearch, Discovery and Analysis of Sensory Data Streams
Search, Discovery and Analysis of Sensory Data Streams
PayamBarnaghi
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
Joaquin Vanschoren
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
Ian Foster
 

Similar to An online semantic enhanced dirichlet model for short text (20)

Model of semantic textual document clustering
Model of semantic textual document clusteringModel of semantic textual document clustering
Model of semantic textual document clustering
 
Writing a scientific manuscript
Writing a scientific manuscriptWriting a scientific manuscript
Writing a scientific manuscript
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
 
Big Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learningBig Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learning
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
Temporal profiles of avalanches on networks
Temporal profiles of avalanches on networksTemporal profiles of avalanches on networks
Temporal profiles of avalanches on networks
 
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
 
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Ngsp
NgspNgsp
Ngsp
 
Effects of Network Structure, Competition and Memory Time on Social Spreading...
Effects of Network Structure, Competition and Memory Time on Social Spreading...Effects of Network Structure, Competition and Memory Time on Social Spreading...
Effects of Network Structure, Competition and Memory Time on Social Spreading...
 
Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability ...
Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability ...Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability ...
Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability ...
 
Ethics reproducibility and data stewardship
Ethics reproducibility and data stewardshipEthics reproducibility and data stewardship
Ethics reproducibility and data stewardship
 
Search, Discovery and Analysis of Sensory Data Streams
Search, Discovery and Analysis of Sensory Data StreamsSearch, Discovery and Analysis of Sensory Data Streams
Search, Discovery and Analysis of Sensory Data Streams
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 

Recently uploaded

一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
nhero3888
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
exukyp
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Rebecca Bilbro
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 

Recently uploaded (20)

一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 

An online semantic enhanced dirichlet model for short text

  • 1. University of Electronic Science & Technology of China 数据挖掘实验室 电 子 科 技 大 学 Data Mining Lab An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering Jay Kumar, Junming Shao, Salah Ud Din and Wazir Ali Dated: July 6, 2020 The 58th Annual Meeting of the Association for Computational Linguistics
  • 2. Outline • Motivation • Existing Problems • Proposed Model • Experimentations • Conclusion 2
  • 3. Motivation • Short-Text data generated by many online sources Clustering / Topic Modeling 3
  • 4. Motivation • News Recommendation, Hot-topic Detection, Advertising, Opinion Mining Clustering / Topic Modeling 4
  • 5. Previous Algorithms • Similarity Based – HPStream (Aggarwal et al., 2003) – FW-Kmeans (Jing et al., 2005) – Den-Stream (Cao et al., 2006) – SPKM (Zhong et al., 2005) – ConStream (Aggarwal et al., 2010) • Require pre-defined similarity threshold • Scarsity and Scalability issues • Model Based – LDA (Blei et al., 2003) – DTM (Blei et al., 2006) – TDPM (Ahmed et al., 2008) – TM-LDA (Wang et al., 2012) – DPMFP (Huang et al., 2013) – NPMM (Chen et al.,2019) – MStream (Yin et al., 2018) • Batch way processing • Lack of semantic embedding 5
  • 6. Challenges • Semantic Information – Term ambiguity • context of a word change related to accompanied words – short length document contains less supportive terms • Concept Drift – Change in topics over time – Life-span of topics differs • Batch v/s Online Processing – Batch Processing • Assumption: no concept drift inside a batch Lady Gaga Perform show The wicked Lady show 6 … Temporal Dependencies B𝑛 B𝑛+1 𝑠𝑡𝑟𝑒𝑎𝑚 = {𝑑𝑡}𝑡=1 ∞
  • 7. Proposed Model • Propose an online semantic- enhanced Dirichlet Model (OSDM) – Non-parametric Probabilistic graphical model – Automatic detection of topics – Online Clustering • Maintain active topics online – Semantic Information 7 𝑤𝑖 𝑤𝑗 𝑤 𝒟 𝒩 ∞ β α θ 𝓏 𝑤𝑖 𝑤𝑗 𝑤 𝒟 𝒩 ∞ β α θ 𝓏 online Basic Idea: • An co-occurrence matrix is embedded to remove term ambiguity • Inverse Cluster frequency of singular term is used as semantic smoothing
  • 8. 𝑤𝑖 𝑤𝑗 𝑤 𝒟 𝒩 ∞ β α θ 𝓏 } 8 OSDM PGM • Probabilistic Model Formulation – Prior probability – new cluster – Existing cluster 𝑝(𝑧𝑑 = 𝑧 |𝑧, 𝛼) 𝑚𝑧 𝐷−1+𝛼𝐷 𝛼𝐷 𝐷 − 1 + 𝛼𝐷 𝑚𝑧 ∶ # of docs in cluster 𝐷 ∶ active documents
  • 9. 𝑤𝑖 𝑤𝑗 𝑤 𝒟 𝒩 ∞ β α θ 𝓏 9 OSDM PGM • Probabilistic Model Formulation – Probability to calculate similarity b/w cluster and document 𝑝(𝑑 | 𝑧𝑑 = 𝑧, 𝑑𝑧→𝑑, 𝛽) – Multinomial distribution – new cluster – Existing cluster 𝑝(𝑑 | 𝑧 𝑑 = 𝑧, 𝑑 𝑧→𝑑 , 𝛽) 𝑤𝜖𝑑 𝑗=1 𝑁𝑑 𝑤 𝑛𝑧,¬𝑑 𝑤 .𝐼𝐶𝐹𝑤+𝛽+𝑗−1 𝑖=1 𝑁𝑑 𝑛𝑧,¬𝑑+𝑉𝛽+𝑖−1 . 𝑤∈𝑑 𝑗=1 𝑁𝑑 𝑤 𝛽 + 𝑗 − 1 𝑖=1 𝑁𝑑 𝑉𝛽 + 𝑖 − 1
  • 10. 10 OSDM PGM • Probabilistic Model Formulation – Probability to calculate similarity b/w cluster and document 𝑝(𝑑 | 𝑧𝑑 = 𝑧, 𝑑𝑧→𝑑, 𝛽) – Multinomial distribution – new cluster – Existing cluster 𝑤𝜖𝑑 𝑗=1 𝑁𝑑 𝑤 𝑛𝑧,¬𝑑 𝑤 .𝐼𝐶𝐹𝑤+𝛽+𝑗−1 𝑖=1 𝑁𝑑 𝑛𝑧,¬𝑑+𝑉𝛽+𝑖−1 . 𝑤∈𝑑 𝑗=1 𝑁𝑑 𝑤 𝛽 + 𝑗 − 1 𝑖=1 𝑁𝑑 𝑉𝛽 + 𝑖 − 1 𝐼𝐶𝐹(𝑤 ∈ 𝑑) = log( |𝑍| |𝑤𝜖𝑍| ) 𝑐𝑤𝑖𝑗 = 𝑑′⊆𝑧 𝑛𝑑′ 𝑤𝑖 𝑑′⊆𝑧 𝑛𝑑′ 𝑤𝑖 + 𝑑′⊆𝑧 𝑛𝑑′ 𝑤𝑗 s.t. (𝑤𝑖, 𝑤𝑗) ∈ 𝑑′
  • 11. • Probability for choosing existing cluster • Probability for creating new cluster OSDM Active Topics T1 T2 𝑝𝑒 𝑧 → 𝑑 = 𝑝𝑒 𝑇1 → 𝑑𝑡 𝑝𝑒 𝑇2 → 𝑑𝑡 11 𝑤𝜖𝑑 𝑗=1 𝑁𝑑 𝑤 𝑛𝑧,¬𝑑 𝑤 .𝐼𝐶𝐹𝑤+𝛽+𝑗−1 𝑖=1 𝑁𝑑 𝑛𝑧,¬𝑑+𝑉𝛽+𝑖−1 . 𝑚𝑧 𝐷−1+𝛼𝐷 . 𝑝𝑛(𝑧𝑛𝑒𝑤 → 𝑑) = ( 𝛼𝐷 𝐷 − 1 + 𝛼𝐷 ) ⋅ ( 𝑤∈𝑑 𝑗=1 𝑁𝑑 𝑤 𝛽 + 𝑗 − 1 𝑖=1 𝑁𝑑 𝑉𝛽 + 𝑖 − 1 ) 𝑝𝑛 𝑇𝑛𝑒𝑤 → 𝑑𝑡 𝑑𝑡
  • 12. • Exponential decay – each cluster have initial decay score = 1 Remove Outdated Clusters 𝑙𝑧 = 𝑙𝑧 × 2−𝜆(△𝑡𝑖𝑚𝑒) Topics = 3 Topics = 4 Topics = 2 Time = 𝑛 Time = 𝑛′ Time = 𝑛′′ 12
  • 13. Algorithm: OSDM Input: 𝑆𝑡: {𝑑𝑡}𝑡=1 ∞ , 𝛼: concentration parameter, β: pseudo weight of term in cluster, 𝜆: decay factor Output: Cluster assignments 𝑧𝑑 1: 𝐾 = 𝜙 2: while 𝑑𝑡 𝑖𝑛 𝑆𝑡 do 3: 𝑡 = 𝑡 + 1 4: 𝐾 = 𝑟𝑒𝑚𝑜𝑣𝑒𝑂𝑙𝑑𝑍𝑖 (𝐾) // delete cluster whose 𝑙𝑧 ≈ 0 5: 𝐾 = 𝑟𝑒𝑑𝑢𝑐𝑒𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑊𝑒𝑖𝑔ℎ𝑡(𝜆 , 𝐾) 6: foreach 𝑧𝑖 ∈ 𝐾 do 7: 𝑃𝑧𝑖 = 𝑝𝑟𝑜𝑏(𝑧𝑖, 𝑑𝑡) using 8: end 9: 𝑖 = 𝐴𝑟𝑔𝑚𝑎𝑥(𝑃𝑧𝑖) 𝑖 10: 𝑃𝑧𝑛 = 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 11: if 𝑃𝑧𝑖 < 𝑃𝑧𝑛 then 12: 𝑚𝑧𝑛 = 1 , 𝑙𝑧𝑛 = 1 , 𝑢𝑧𝑛 = 𝑡 , 𝑛𝑧𝑛 𝑤 = 𝑁𝑑𝑡 𝑤 , 𝑐𝑤𝑧𝑛 = 𝑐𝑤𝑑𝑡 , 𝑙𝑒𝑛𝑧𝑛 = 𝑙𝑒𝑛𝑑𝑡 13: 𝐾 = 𝐾 ∪ 𝑧𝑛 14: else 15: 𝑚𝑧𝑖 = 𝑚𝑧𝑖 + 1 , 𝑙𝑧𝑖 = 1 , 𝑢𝑧𝑛 = 𝑡 , 𝑛𝑧𝑖 𝑤 = 𝑛𝑧𝑖 𝑤 +𝑁𝑑𝑡 𝑤 , 𝑐𝑤𝑧𝑖 = 𝑐𝑤𝑧𝑖 ∪ 𝑐𝑤𝑑𝑡 , 𝑙𝑒𝑛𝑧𝑖 = 𝑙𝑒𝑛𝑧𝑖 + 𝑙𝑒𝑛𝑧𝑡 16: end 17: end 𝑤𝜖𝑑 𝑗=1 𝑁𝑑 𝑤 𝑛𝑧,¬𝑑 𝑤 .𝐼𝐶𝐹𝑤+𝛽+𝑗−1 𝑖=1 𝑁𝑑 𝑛𝑧,¬𝑑+𝑉𝛽+𝑖−1 . 𝑚𝑧 𝐷−1+𝛼𝐷 . ( 𝛼𝐷 𝐷 − 1 + 𝛼𝐷 ) ⋅ ( 𝑤∈𝑑 𝑗=1 𝑁𝑑 𝑤 𝛽 + 𝑗 − 1 𝑖=1 𝑁𝑑 𝑉𝛽 + 𝑖 − 1 ) 13
  • 14. Complexity • Space Complexity – 𝑂(𝐾 𝑉 + 𝑉2 + 𝑉𝐷) • 𝐾 average number of active topics/clusters • 𝐷 number of active documents • 𝑉 number of active vocabulary • Each cluster store average 𝑉 vocabulary • Co-occurrence matrix in a cluster at most 𝑉 . 𝑉 • Time Complexity – 𝑂(𝐾(ℒ𝑉)) • ℒ is average length of each arriving document 14
  • 16. Experimentation • Dataset • Baselines – DTM (Blei and Lafferty, 2006) – Sumblr (Sb.) (Shou et al., 2013) – DMM (Yin and Wang, 2014) – MStreamF (Yin et al., 2018) online model (MF-O) and iterative model (MF-G) Dataset Docs Vocab Topics News 11109 8110 152 News-T 11109 8110 152 Tweets 30322 12301 269 Reuters 9,447 32303 66 Reuters-T 9,447 32303 66 16
  • 18. Performance over Stream 18 News News-T Reuters Reuters-T Tweets
  • 19. Parameter Sensitivity 𝛼 − 𝑠𝑒𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 19 𝛽 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 𝜆 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
  • 20. Conclusion • Online semantic-enhanced Dirichlet Model • Properties – work online instead of batch-way – remove term ambiguity with semantic information – can automatically detect number of topics over time – can maintain evolving clusters over time • Create new clusters • Remove outdated clusters 20
  • 21. Thank you! An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering Acknowledgment: This work is supported by the National Natural Science Foundation of China (61976044), Fundamental Research Funds for the Central Universities (ZYGX2019Z014), Fok Ying-Tong Education Foundation for Young Teachers in the Higher Education Institutions of China (161062), National key research and development program (2016YFB0502300). Any Questions? 21
  • 22. References • David M. Blei and John D. Lafferty. 2006. Dynamic topic models. ACM International Conference Proceeding Series, 148:113–120. • Lidan Shou, Zhenhua Wang, Ke Chen, and Gang Chen. 2013. Sumblr Continuous Summarization of Evolving Tweet Streams. In Proceedings of the 6th international ACM SIGIR conference on Research and development in information retrieval, pages 533– 542. • Jianhua Yin and Jianyong Wang. 2014. A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 233–242. • Jianhua Yin, Daren Chao, Zhongkun Liu, Wei Zhang, Xiaohui Yu, and Jianyong Wang. 2018. Modelbased Clustering of Short Text Streams. In ACM International Conference on Knowledge Discovery and Data Mining, pages 2634–2642. • Amir Hadifar, Lucas Sterckx, Thomas Demeester, and Chris Develder. 2019. A Self-Training Approach for Short Text Clustering. In Proceedings of the 4th Workshop on Representation Learning for NLP (ACL), pages 194–199. • Junyang Chen, Zhiguo Gong, and Weiwen Liu. 2019. A nonparametric model for online topic discovery with word embeddings. Information Sciences, 504:32–47. • Hongyu Gong, Tarek Sakakini, Suma Bhat, and Jinjun Xiong. 2018. Document similarity for texts of varying lengths via hidden topics. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 2341–2351. • Yukun Zhao, Shangsong Liang, Zhaochun Ren, Jun Ma, Emine Yilmaz, and Maarten de Rijke. 2016. Explainable User Clustering in Short Text Streams. In International ACM conference on Research and Development in Information Retrieval, pages 155–164. • Shi Zhong. 2005. Efficient streaming text clustering. Neural Networks, 18(5-6):790–798. 22
  • 24. Challenges • Batch-wise v/s Online – unknown number of docs per time unit – different life-span of Topics – distribution of topics change over time … Stream …
  • 25. Challenges • Batch-wise versus Online – Batch: divide a stream into fixed size chunk – Assumption: no concept drift inside a batch – documents far away from each other on time unit scale have weaker relationship – previous models work in batch way … Temporal Dependencies B𝑛 B𝑛+1 Fixed size Batch Stream 25 𝑠𝑡𝑟𝑒𝑎𝑚 = {𝑑𝑡}𝑡=1 ∞
  • 26. Conclusion • OSDM dynamically assigns each arriving document into an existing cluster or generating a new cluster • Maintain active topics • OSDM remove the term ambiguity problem using semantic information in the probabilistic graphical model • Promising high-quality clustering results 26
  • 27. Batch-wise versus Online Topics = 3 Topics = 4 Topics = 2 Time = 𝑛 Time = 𝑛′ Time = 𝑛′′ 27 𝑐𝑖 = { 𝑑1 𝑐𝑖 , 𝑑2 𝑐𝑖 , … 𝑑𝑛 𝑐𝑖 } 𝑐𝑖 ∩ 𝑐𝑗 = ∅