Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Semantic Modeling 
Computational Framework for 
Generating Visual Summaries of 
Topical Clusters in Twitter Streams* 
Auth...
Visual Summaries of Twitter Streams 
2 
http://flowingdata.com/wp-content/uploads/2010/02/treemap-revised1.gif 
http://www...
Step 1: 
get & 
pre-process Data 
construct graph & 
clustering 
extract keywords & 
summarize 
Keywords 
Stream 
Tweets 
...
Input: Keywords 
• initial set of Keywords 
• similar to Twitter Search 
4
Input: Keywords 
• initial set of Keywords 
• similar to Twitter Search 
5
Step 1: Stream Tweets 
• HTTP base API 
- JSON, REST 
6
7 
• OAuth + HTTP 
• here: java library with 
scala and play!framework
Step 1: Preprocessing 
• transform Tweets 
- easy-to-analyze / clan format 
• Process of cleaning: 
1. lowercase 
2. remov...
Step 1: Preprocessing 
• Example Keywords: 
- SCALA 
- Scala 
- scala 
- #scala 
• Ling Pipe Library* 
- remove tense and ...
Step 1: Preprocessing 
• Example Tweets 
10 
new york time 
reactive 
programming 
tool scala scale 
techrepublic 
akka-ht...
Step 1: Preprocessing 
• Example Tweets 
11 
new york time 
reactive 
programming 
tool scala scale 
techrepublic 
akka-ht...
Step 2: Graph 
• Word Co-Occurrence Graph 
- Word = Node (Unigrams) 
- Tweet = Link between Nodes 
• Example 
akka-http ba...
Step 2: Graph 
• Word Co-Occurrence Graph 
- Word = Node (Unigrams) 
- Tweet = Link between Nodes 
• Example 
akka-http ba...
Step 2: Graph 
• Word Co-Occurrence Graph 
- Word = Node (Unigrams) 
- Tweet = Link between Nodes 
• Example 
14 *http://a...
Step 2: Graph 
• Word Co-Occurrence Graph 
- Word = Node (Unigrams) 
- Tweet = Link between Nodes 
• Example 
15 *http://a...
Step 2: Graph 
• Word Co-Occurrence Graph 
- Word = Node (Unigrams) 
- Tweet = Link between Nodes 
• Example 
16 *http://a...
17
18
Step 2: Graph 
• Co-Occurrence Graph 
- connect nodes (words) within and between 
tweets 
- add strength (weight) and cost...
Step 2: Graph 
• Summary 
+ 
= 
reactive 
scala 
stream 
based 
… 
uses 
programming 
…
Step 2: Clustering 
• Here: „complete link (max) clustering“ algorithm 
- hierarchical clustering algorithm that forms 
cl...
Step 2: Clustering 
• Here: „complete link (max) clustering“ algorithm 
• each node starts as individual cluster 
! 
Clust...
Step 2: Clustering 
Graph Representation Cluster Representation 
reactive 
scala 
stream 
based 
… 
reactive 
scala 
strea...
Step 2: Clustering 
24
Step 2: Clustering 
distance = 0.5 
25
Step 2: Clustering 
distance = 1 
distance = 0.5 
distance = 1 
26
Step 2: Clustering 
distance = 1 
distance = 0.5 
distance = 1 
27 
1 
1
Step 2: Clustering 
distance = 1 
distance = 0.5 
distance = 1 
28 
distance = 2 
1 
1
Step 2: Clustering 
29
Step 2: Clustering 
• Final step: Dendrogram 
- tree diagram 
- represents the arrangement of hierarchical clusters 
• why...
Step 2: Clustering 
• Final step: Dendrogram 
- closer to the root = lower similarity 
root 
reactive scala 
31 
first clu...
Step 2: Clustering 
• Final step: Dendrogram 
- closer to the root = lower similarity 
root 
new york programming … akka-h...
Step 2: Clustering 
• Final step: Dendrogram 
- closer to the root = lower similarity 
root 
new york programming … akka-h...
34
Step 3: Extract topical keywords 
Preprocessing/ 
Cleaning 
35 
Construct 
Graph 
Extract Topical 
Keywords
Step 3: Extract topical keywords 
• keywords 
- express a topic 
- frequently used 
- summarize tweets content 
• Question...
Step 3: Extract topical keywords 
• How? 
- „topical tweets“ vs. „general tweets“ 
• frequently in topical tweets! 
- sear...
Step 3: Extract topical keywords 
• Strength of a word 
- is a word relevant for that topical cluster? 
38 
Low 
Frequency...
Step 3: Extract topical keywords 
• Strength of a word 
- is a word relevant for that topical cluster? 
39 
Low 
Frequency...
Step 3: Extract topical keywords 
• Result 
- topical strength for each keyword 
- sort them by relevancy 
- select top 20...
Final Step 
• Combine clusters and keywords 
• create visual summary 
41
Final Step 
42 
• Keyword1 
• Keyword2 
• Keyword3 
• Keyword4 
• … 
high relevancy 
low relevancy
Final Step 
43 
• Keyword1 
• Keyword2 
• Keyword3 
• Keyword4 
• … 
high relevancy 
low relevancy
Final Step 
44 
• Treemap Visualisation 
- color = cluster 
- area of word = frequency of word
Final Step 
• Wordcloud Visualisation 
- color = cluster 
- size of word = frequency of word 
45
Final Notes 
• 4. Million Topical Tweets 
• 15 Days 
• User Study 
- Treemap vs. Word Cloud 
46
Thank You! 
• Discussion 
- Loosing precision while cleaning tweet 
- Loosing sense while removing stop words like 
„not“ ...
Upcoming SlideShare
Loading in …5
×

Computational Framework for Generating Visual Summaries of Topical Clusters in Twitter Streams

606 views

Published on

Computational Framework for Generating Visual Summaries of Topical Clusters in Twitter Streams

Based on: http://link.springer.com/chapter/10.1007%2F978-3-319-02993-1_9

Published in: Software
  • Be the first to comment

  • Be the first to like this

Computational Framework for Generating Visual Summaries of Topical Clusters in Twitter Streams

  1. 1. Semantic Modeling Computational Framework for Generating Visual Summaries of Topical Clusters in Twitter Streams* Authors: Presenter: ! Miray Kas Sebastian Alfers - HTW Berlin Bongwon Suh 1 * http://link.springer.com/chapter/10.1007%2F978-3-319-02993-1_9
  2. 2. Visual Summaries of Twitter Streams 2 http://flowingdata.com/wp-content/uploads/2010/02/treemap-revised1.gif http://www.infobarrel.com/media/image/54054.jpg
  3. 3. Step 1: get & pre-process Data construct graph & clustering extract keywords & summarize Keywords Stream Tweets Preprocessing/ Cleaning Construct Graph Clustering Select Relevant Clusters Extract Topical Keywords Visual Cluster Summary Step 2: Step 3: 3
  4. 4. Input: Keywords • initial set of Keywords • similar to Twitter Search 4
  5. 5. Input: Keywords • initial set of Keywords • similar to Twitter Search 5
  6. 6. Step 1: Stream Tweets • HTTP base API - JSON, REST 6
  7. 7. 7 • OAuth + HTTP • here: java library with scala and play!framework
  8. 8. Step 1: Preprocessing • transform Tweets - easy-to-analyze / clan format • Process of cleaning: 1. lowercase 2. remove urls, user mentions and stop words • like @user, „a“ or „123“ 3. remove special characters (#,.) 8
  9. 9. Step 1: Preprocessing • Example Keywords: - SCALA - Scala - scala - #scala • Ling Pipe Library* - remove tense and plurals 9 }scala *http://alias-i.com/lingpipe/
  10. 10. Step 1: Preprocessing • Example Tweets 10 new york time reactive programming tool scala scale techrepublic akka-http based reactive stream scala scaladay
  11. 11. Step 1: Preprocessing • Example Tweets 11 new york time reactive programming tool scala scale techrepublic akka-http based reactive stream scala scaladay
  12. 12. Step 2: Graph • Word Co-Occurrence Graph - Word = Node (Unigrams) - Tweet = Link between Nodes • Example akka-http based stream reactive scala scaladay 12 *http://alias-i.com/lingpipe/
  13. 13. Step 2: Graph • Word Co-Occurrence Graph - Word = Node (Unigrams) - Tweet = Link between Nodes • Example akka-http based stream reactive scala scaladay 13 *http://alias-i.com/lingpipe/
  14. 14. Step 2: Graph • Word Co-Occurrence Graph - Word = Node (Unigrams) - Tweet = Link between Nodes • Example 14 *http://alias-i.com/lingpipe/ based akka-http reactive stream scaladay scala
  15. 15. Step 2: Graph • Word Co-Occurrence Graph - Word = Node (Unigrams) - Tweet = Link between Nodes • Example 15 *http://alias-i.com/lingpipe/ based akka-http reactive stream scaladay scala Nodes NLoindkess
  16. 16. Step 2: Graph • Word Co-Occurrence Graph - Word = Node (Unigrams) - Tweet = Link between Nodes • Example 16 *http://alias-i.com/lingpipe/ based akka-http reactive stream scaladay scala
  17. 17. 17
  18. 18. 18
  19. 19. Step 2: Graph • Co-Occurrence Graph - connect nodes (words) within and between tweets - add strength (weight) and cost (distance) • More frequently words - increase the strength - decrease cost 19
  20. 20. Step 2: Graph • Summary + = reactive scala stream based … uses programming …
  21. 21. Step 2: Clustering • Here: „complete link (max) clustering“ algorithm - hierarchical clustering algorithm that forms clusters by merging subgroups • Group Words from Tweets - frequently appear on topic - cluster = topic * http://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link-clustering-1.html
  22. 22. Step 2: Clustering • Here: „complete link (max) clustering“ algorithm • each node starts as individual cluster ! Clusters = Nodes = Words in tweet • close clusters are successively merged together - close = highest cost within clusters 22
  23. 23. Step 2: Clustering Graph Representation Cluster Representation reactive scala stream based … reactive scala stream based … 23 cost = distance = 0.5 cost = distance = 1 1 1
  24. 24. Step 2: Clustering 24
  25. 25. Step 2: Clustering distance = 0.5 25
  26. 26. Step 2: Clustering distance = 1 distance = 0.5 distance = 1 26
  27. 27. Step 2: Clustering distance = 1 distance = 0.5 distance = 1 27 1 1
  28. 28. Step 2: Clustering distance = 1 distance = 0.5 distance = 1 28 distance = 2 1 1
  29. 29. Step 2: Clustering 29
  30. 30. Step 2: Clustering • Final step: Dendrogram - tree diagram - represents the arrangement of hierarchical clusters • why? - easy to apply thresholds metics 30
  31. 31. Step 2: Clustering • Final step: Dendrogram - closer to the root = lower similarity root reactive scala 31 first cluster
  32. 32. Step 2: Clustering • Final step: Dendrogram - closer to the root = lower similarity root new york programming … akka-http based stream scaladay 32 reactive scala
  33. 33. Step 2: Clustering • Final step: Dendrogram - closer to the root = lower similarity root new york programming … akka-http based stream scaladay 33 reactive scala thresholds
  34. 34. 34
  35. 35. Step 3: Extract topical keywords Preprocessing/ Cleaning 35 Construct Graph Extract Topical Keywords
  36. 36. Step 3: Extract topical keywords • keywords - express a topic - frequently used - summarize tweets content • Questions - „What are the relevant keywords?“ - „In what clusters do they appear?“ 36
  37. 37. Step 3: Extract topical keywords • How? - „topical tweets“ vs. „general tweets“ • frequently in topical tweets! - search keywords „reactive scala“! • not frequently in general tweets! - general twitter stream (all tweets) 37
  38. 38. Step 3: Extract topical keywords • Strength of a word - is a word relevant for that topical cluster? 38 Low Frequency High Frequency Low Frequency High Frequency Topical Tweets General Tweets
  39. 39. Step 3: Extract topical keywords • Strength of a word - is a word relevant for that topical cluster? 39 Low Frequency High Frequency Low Frequency High Frequency Topical Tweets General Tweets ✔ relevant for topic / cluster
  40. 40. Step 3: Extract topical keywords • Result - topical strength for each keyword - sort them by relevancy - select top 20 keyword • choose clusters that contain this words 40
  41. 41. Final Step • Combine clusters and keywords • create visual summary 41
  42. 42. Final Step 42 • Keyword1 • Keyword2 • Keyword3 • Keyword4 • … high relevancy low relevancy
  43. 43. Final Step 43 • Keyword1 • Keyword2 • Keyword3 • Keyword4 • … high relevancy low relevancy
  44. 44. Final Step 44 • Treemap Visualisation - color = cluster - area of word = frequency of word
  45. 45. Final Step • Wordcloud Visualisation - color = cluster - size of word = frequency of word 45
  46. 46. Final Notes • 4. Million Topical Tweets • 15 Days • User Study - Treemap vs. Word Cloud 46
  47. 47. Thank You! • Discussion - Loosing precision while cleaning tweet - Loosing sense while removing stop words like „not“ (negate) - Unigram vs. Multigram? - ? 47

×