Dynamic Multi-Faceted
Topic Discovery in Twitter

Jan Vosecky
Di Jiang
Kenneth Wai-Ting Leung
Wilfred Ng
Twitter

2
Representation

• Vector space model
– Term vector

sparseness issue

• Topic models
– Latent topic vector

better than VS...
Topic Models
A latent topic in LDA
“Arab revolutions”
Libya
Force
Human
Abuse
Protect
Secure
War
Execute

0.00040
0.00020
...
A topic in Twitter?
• Not just words
• People talk about entities

Organizations

Persons

Locations

…

Time
5
Multi-faceted Topic Model

• Each topic consists of n facets
– Elements of each facet ~ multinomial distribution

• Each d...
Multi-faceted Topic Model
Multi-faceted latent topic “Arab revolutions”
General terms

Persons

Locations

Organizations

...
Parameter Inference
• Scalability
– Gibbs sampling and variational inference
process data in a batch
doc

inference
doc

d...
Perplexity comparison:
Online inference vs. Gibbs sampling

K = 50

K = 200

9
Tweet Clustering
Vector space model
(TF-IDF)

K-means DBSCAN

Direct

(a) Manually-labeled dataset

K-means DBSCAN

Direct...
Summary
• Model multi-faceted topics in microblogs
– Entity-oriented and dynamic

• Online inference method
• Beneficial f...
Thank You!

Jan Vosecky
Di Jiang
Kenneth Wai-Ting Leung
Wilfred Ng

12
Upcoming SlideShare
Loading in …5
×

Dynamic Multi-Faceted Topic Discovery in Twitter

1,698 views

Published on

Discovering high-level topics from social streams is important for many downstream applications. However, traditional text mining methods that rely on the bag-of-words model are insufficient to uncover the rich semantics and temporal aspects of topics in Twitter. In particular, topics in Twitter are inherently dynamic and often focus on specific entities, such as people or organizations. In this paper, we therefore propose a method for mining multifaceted topics from Twitter streams. The Multi-Faceted Topic Model (MfTM) is proposed to jointly model latent semantics among terms and entities and captures the temporal characteristics of each topic. We develop an efficient online inference method for MfTM, which enables our model to be applied to large-scale and streaming data. Our experimental evaluation shows the effectiveness and efficiency of our model compared with state-of-the-art baselines. We further demonstrate the effectiveness of our framework in the context of tweet clustering.

More info: http://www.cse.ust.hk/~jvosecky/

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,698
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Dynamic Multi-Faceted Topic Discovery in Twitter

  1. 1. Dynamic Multi-Faceted Topic Discovery in Twitter Jan Vosecky Di Jiang Kenneth Wai-Ting Leung Wilfred Ng
  2. 2. Twitter 2
  3. 3. Representation • Vector space model – Term vector sparseness issue • Topic models – Latent topic vector better than VSM? 3
  4. 4. Topic Models A latent topic in LDA “Arab revolutions” Libya Force Human Abuse Protect Secure War Execute 0.00040 0.00020 0.00010 0.00010 0.00009 0.00008 0.00005 0.00004 4
  5. 5. A topic in Twitter? • Not just words • People talk about entities Organizations Persons Locations … Time 5
  6. 6. Multi-faceted Topic Model • Each topic consists of n facets – Elements of each facet ~ multinomial distribution • Each document d is a distribution over topics – General terms, named entities and timestamp drawn from the respective facet of topic z 6
  7. 7. Multi-faceted Topic Model Multi-faceted latent topic “Arab revolutions” General terms Persons Locations Organizations Time 7
  8. 8. Parameter Inference • Scalability – Gibbs sampling and variational inference process data in a batch doc inference doc doc doc • Online inference – Stochastic variational inference to process streaming data inference  Model continuously updated … doc doc doc doc …  Constant time to process a new doc 8
  9. 9. Perplexity comparison: Online inference vs. Gibbs sampling K = 50 K = 200 9
  10. 10. Tweet Clustering Vector space model (TF-IDF) K-means DBSCAN Direct (a) Manually-labeled dataset K-means DBSCAN Direct (b) Hashtag-labeled dataset 10
  11. 11. Summary • Model multi-faceted topics in microblogs – Entity-oriented and dynamic • Online inference method • Beneficial for downstream applications 11
  12. 12. Thank You! Jan Vosecky Di Jiang Kenneth Wai-Ting Leung Wilfred Ng 12

×