Making Sense of Millions of Thoughts: Finding Patterns in the Tweets

841
-1

Published on

I gave this presentation at Workshop on Interactive Language Learning, Visualization, and Interfaces / ACL 2014 in Baltimore, MD on June 27, 2014.
http://nlp.stanford.edu/events/illvi2014/index.html

ABSTRACT
Everyday on Twitter, there are millions of thoughts that are captured and shared to the world in the form of 140-character messages, or Tweets. There are many things we could learn from these thoughts if we could figure out a way to digest this gigantic dataset. Visualization is one of the many ways to extract information from these Tweets. In this presentation, I will talk about several visualizations based on Tweets, as well as share experiences and challenges from working with Tweet data.

Making Sense of Millions of Thoughts: Finding Patterns in the Tweets

  1. 1. Making Sense of Millions of Thoughts Finding patterns in the Tweets “Knowing comes from learning, from seeking.” “What we call chaos is just we haven't recognized.” “I am looking for a needle haystack.” “140-character text messages, called ” Krist Wongsuphasawat (50 characters) (58 characters) (42 characters) (42 characters)
  2. 2. X-Men
  3. 3. Prof. X Ability: Telepathy (mind reading)
  4. 4. Cerebro Enhance telepathy Prof. X
  5. 5. Cerebro
  6. 6. With this power…
  7. 7. What are you thinking?
  8. 8. What are people thinking about x? Product Event Person etc.
  9. 9. Reality
  10. 10. Cerebro
  11. 11. Internet
  12. 12. Platform thought thought thought thought thought crowdsourcing social networks Data
  13. 13. Twitter tweet tweet tweet tweet tweet Tweets
  14. 14. Tweets • 140 characters • text + media • geo • time
  15. 15. Twitter tweet tweet tweet tweet tweet Tweets
  16. 16. What can we learn from these Tweets?
  17. 17. visual-insights@twitter @miguelrios @philogb @trebor @kristw
  18. 18. World Cup Election Oscars Pure Curiosity Grammy TV Shows New Year Breaking news Earthquake
  19. 19. Insights, Stories (Tweets) DATA with limited time Audience: general public
  20. 20. Tools • Hadoop • Apache Pig • Vertica • node.js, python • d3 & co.
  21. 21. Pig
  22. 22. Insights, Stories (Tweets) DATA
  23. 23. Insights, Stories (Tweets) Filter DATA
  24. 24. Having all Tweets How people think I feel.
  25. 25. Having all Tweets How people think I feel. How I really feel.
  26. 26. Filter data Good news: Bad news: Want only relevant Tweets Have all Tweets Too many Tweets
  27. 27. Filter data (2) • #hashtags — e.g. #world-cup • easy to filter • hashtags must be presented • typo?
  28. 28. Filter data (2) • #hashtags — e.g. #world-cup • easy to filter • hashtags must be presented • keywords — e.g. goal • broader • can be ambiguous
  29. 29. Filter data (3) • Combine with other attributes • Time • during the first half of World Cup final
  30. 30. Filter data (3) • Combine with other attributes • Time • during the first half of World Cup final • Location • Tweets from Brazil • Not every Tweet is geotagged.
  31. 31. Filter data (4) • Languages • Sometimes use only English Tweets • Future • Translation?
  32. 32. Insights, Stories (Tweets) Filter Clean DATA
  33. 33. Clean data • Typo (Mobile input) • Abbreviation (due to 140-character limit) • Exaggeration (e.g. GOOOOALLLL) • Twitter specific e.g., Old-style retweet “RT …” • Inappropriate content
  34. 34. Insights, Stories (Tweets) Filter Clean Visualize DATA
  35. 35. (+ media) photos, videos What? Where? When? GEO TIME TEXT DATA
  36. 36. What? Where? When? GEO TIME TEXT Visualize Data
  37. 37. What? Where? When? GEO TIME TEXT Visualize Data
  38. 38. TIME Tweets/second
  39. 39. TIME Tweets/second
  40. 40. TIME Tweets/second + Annotation http://www.flickr.com/photos/twitteroffice/5681263084/
  41. 41. TIME Tweets/second + Annotation Manual To automate Top tweets (most Retweets, Favs)
  42. 42. What? Where? When? GEO TIME TEXT Visualize Data
  43. 43. GEO Heatmap Low density High density
  44. 44. GEO New York City flickr.com/photos/twitteroffice/8798020541
  45. 45. GEO San Francisco flickr.com/photos/twitteroffice/8798020541
  46. 46. GEO San Francisco Rebuild the world based on tweet volumes twitter.github.io/interactive/andes/
  47. 47. What? Where? When? GEO TIME TEXT Visualize Data
  48. 48. TIME + GEO blog.twitter.com/2011/global-pulse youtu.be/SybWjN9pKQk Japan Earthquake 2011
  49. 49. TIME + GEO Tweet pattern [Rios & Lin 2012] Night Late night Daytime Night Late night Daytime
  50. 50. What? Where? When? GEO TIME TEXT Visualize Data
  51. 51. TEXT Trends
  52. 52. TEXT www.wordle.net Some samples from World Cup
  53. 53. TEXT Word cloud of Tweets right after the 1st goal www.wordle.net
  54. 54. TEXT WordTree [Wattenberg & Viégas 2008] www.jasondavies.com/wordtre www.jasondavies.com/wordtree
  55. 55. TEXT • Now • Derived information: Sentiment, Topic • Combine with other information (geo & time) + context • Future • Better technique + involves more NLP e.g. key phrases, etc.
  56. 56. TEXT Descriptive Keyphrases [Chuang et al. 2012]
  57. 57. TEXT • Challenge • Scale
  58. 58. What? Where? When? GEO TIME TEXT Visualize Data
  59. 59. GEO + TEXT Real-time Tweet map
  60. 60. GEO + TEXT Real-time Tweet map
  61. 61. GEO + TEXT Real-time Tweet map most frequent term
  62. 62. GEO + TEXT Real-time Tweet map Gmail went down Jan 24, 2014
  63. 63. GEO + TEXT Real-time Tweet map Nelson Mandela passed away Dec 5, 2013
  64. 64. GEO + TEXT Real-time Tweet map • Next: • Involves more NLP • Tokenization - Languages without space between words • etc. • Challenge: • Real-time
  65. 65. GEO + TEXT www.yelp.com/wordmap Yelp Wordmap
  66. 66. What? Where? When? GEO TIME TEXT Visualize Data
  67. 67. TIME + TEXT http://www.babynamewizard.com/voyager Baby Name Voyager
  68. 68. TIME + TEXT http://www.babynamewizard.com/voyager Baby Name Voyager
  69. 69. TIME + TEXT UEFA Champions League Biggest Tournament for European soccer clubs Many Tweets during the matches
  70. 70. TIME + TEXT UEFA Champions League Dortmund Bayern Munich Count Tweets mentioning the teams every minute Team 1 Team 2
  71. 71. TIME + TEXT UEFA Champions League
  72. 72. TIME + TEXT UEFA Champions League + “goal” count + context
  73. 73. TIME + TEXT UEFA Champions League + “offside”
  74. 74. TIME + TEXT UEFA Champions League + players
  75. 75. A B C D A C C Competition Tree vs vs vs
  76. 76. A B C D A C C Competition Tree + vs vs vs
  77. 77. A B C D A C C Competition Tree + = uclfinal.twitter.com vs vs vs
  78. 78. TIME + TEXT UEFA Champions League • Challenges • Filter relevance tweets • Multiple matches at the same time • Ambiguous words: “goal”, “red”, “yellow” • Tweets mentioning both teams e.g. “#GER 2-2 #GHA”
  79. 79. What? Where? When? GEO TIME TEXT Visualize Data
  80. 80. TIME + GEO + TEXT State of the Union twitter.github.io/interactive/sotu2014
  81. 81. TIME + GEO + TEXT State of the Union 1) timeline + topic from Tweets 4) Density map of Tweets about selected topic 3) Volume of Tweets by topics during selected part of the SOTU 2) context (speech) twitter.github.io/interactive/sotu2014
  82. 82. TIME + GEO + TEXT New Year 2014
  83. 83. TIME + GEO + TEXT New Year 2014
  84. 84. TIME + GEO + TEXT New Year 2014 twitter.github.io/interactive/newyear2014/
  85. 85. Recap
  86. 86. What can we learn from these Tweets? many, many things.
  87. 87. better the examples in this talk imagine… DATA (Tweets)
  88. 88. Insights, Stories (Tweets) Filter Clean Visualize DATA
  89. 89. (Tweets) Insights, Stories Filter Clean Process & Visualize DATA
  90. 90. (Tweets) Insights, Stories Filter Clean Process & Visualize DATA NLP
  91. 91. TEXT What? Where? When? GEO TIME Visualize data
  92. 92. (Tweets) Insights, Stories Filter Clean Process & Visualize DATA Research
  93. 93. Working together Raw data Human
  94. 94. Working together Raw data Human Computer (One machine, Cloud, MapReduce, etc.)
  95. 95. Working together Raw data Human Ignored informationProcessed information Computer (One machine, Cloud, MapReduce, etc.)
  96. 96. Working together Raw data Human Aggregated information Ignored informationProcessed information Computer (One machine, Cloud, MapReduce, etc.)
  97. 97. Working together Raw data Human Aggregated information Ignored informationProcessed information Computer (One machine, Cloud, MapReduce, etc.) NLP Make computers think more like Human.
  98. 98. Working together Raw data Human Aggregated information Ignored informationProcessed information VIS Help people consume information. Computer (One machine, Cloud, MapReduce, etc.) NLP Make computers think more like Human.
  99. 99. Working together Raw data Human Aggregated information Ignored informationProcessed information VIS Help people consume information. Computer (One machine, Cloud, MapReduce, etc.) NLP Make computers think more like Human. HCI User interactions or Provide feedback Bridge the gap. Connect human & computer.
  100. 100. Advanced techniques vs. Scalability
  101. 101. LifeFlow => Flying Sessions Research System at Twitter
  102. 102. Summary • Thoughts are captured in the Tweets: what, where, when • Finding patterns from: text + geo + time • Opportunities for NLP + HCI + VIS collaboration • Better technique vs. Scalability + Real-time @kristw / interactive.twitter.com
  103. 103. Questions?
  104. 104. Thank you

×