Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Adventure in Data: A tour of visualization projects at Twitter

922 views

Published on

Guest lecture at Prof. David Gotz's UNC Chapel Hill INLS 690 Visual Analytics class (Given remotely) on Nov 10, 2015.

Many demos can also be accessed from interactive.twitter.com and kristw.yellowpigz.com

Published in: Data & Analytics

Adventure in Data: A tour of visualization projects at Twitter

  1. 1. Krist Wongsuphasawat / @kristw Adventure in data A tour of visualizations at Twitter
  2. 2. Krist Wongsuphasawat / @kristw
  3. 3. Computer Engineer Bangkok, Thailand PhD in Computer Science Univ. of Maryland Information Visualization IBM Microsoft Data Visualization Scientist Twitter Krist Wongsuphasawat / @kristw
  4. 4. Krist Wongsuphasawat / @kristw Adventure in data A whirlwind tour of visualization projects at Twitter
  5. 5. Get data
  6. 6. Having all Tweets How people think I feel.
  7. 7. How people think I feel. How I really feel. Having all Tweets
  8. 8. • Too much data. Want only relevant Tweets • hashtag: #BRA • keywords: “goal” • Need to aggregate & reduce size • Long processing time (hours) Challenges
  9. 9. Hadoop Cluster Data Storage Workflow
  10. 10. Hadoop Cluster Pig / Scalding (slow) Data Storage Tool Workflow
  11. 11. Hadoop Cluster Pig / Scalding (slow) Data Storage Tool Workflow
  12. 12. Hadoop Cluster Pig / Scalding (slow) Data Storage Tool Smaller datasetYour laptop Workflow
  13. 13. Hadoop Cluster Pig / Scalding (slow) Data Storage Tool Final dataset Tool node.js / python / excel (fast) Your laptop Workflow Smaller dataset
  14. 14. Krist Wongsuphasawat / @kristw Cleaning data a story of my life
  15. 15. Storytelling Analytics Tools Creative Projects
  16. 16. To understand the world and share the stories To understand Twitter users and improve the service To showcase the data and inspire Projects Storytelling Analytics Tools Creative
  17. 17. Storytelling1 World Cup Election Oscars TV Shows New Year Earthquake Super Bowl Protest … Behaviors Sleeping Daylight saving Language … Events Fasting Information spread Commute
  18. 18. So many things we could learn from Twitter data
  19. 19. Give us interesting vis about xxxx by Nov 10
  20. 20. Challenge accepted
  21. 21. Tweets (+ media) photos, videos What? Where? When? GEO TIME TEXT
  22. 22. What? Where? When? GEO TIME TEXT Visualize Tweets
  23. 23. What? Where? When? GEO TIME TEXT Visualize Tweets
  24. 24. Time Tweets/second
  25. 25. Time Tweets/second
  26. 26. Time Tweets/second + Annotation http://www.flickr.com/photos/twitteroffice/5681263084/
  27. 27. What? Where? When? GEO TIME TEXT Visualize Tweets
  28. 28. Geo Heatmap Low density High density
  29. 29. Geo San Francisco flickr.com/photos/twitteroffice/8798020541 Low density High density
  30. 30. Geo San Francisco Rebuild the world based on tweet volumes twitter.github.io/interactive/andes/
  31. 31. What? Where? When? GEO TIME TEXT Visualize Tweets
  32. 32. Text www.wordle.net Some experiments during World Cup
  33. 33. Text www.wordle.net Word cloud of Tweets right after the 1st goal
  34. 34. Text Word cloud of Tweets right after the 1st goal www.wordle.net It was an “own” goal.
  35. 35. Text WordTree [Wattenberg & Viégas 2008] www.jasondavies.com/wordtre www.jasondavies.com/wordtree
  36. 36. Text word/phrase/hashtag count topic
  37. 37. What? Where? When? GEO TIME TEXT Visualize Tweets
  38. 38. Time + Geo Tweet pattern [Rios & Lin 2012] Night Late night Daytime Night Late night Daytime
  39. 39. Night Late night Daytime Night Late night Daytime Time + Geo Tweet pattern [Rios & Lin 2012]
  40. 40. Night Late night Daytime Night Late night Daytime Time + Geo Tweet pattern [Rios & Lin 2012]
  41. 41. Night Late night Daytime Night Late night Daytime Time + Geo Tweet pattern [Rios & Lin 2012]
  42. 42. What? Where? When? GEO TIME TEXT Visualize Tweets
  43. 43. Geo + Text Real-time Tweet map
  44. 44. Geo + Text Real-time Tweet map
  45. 45. most frequent term Geo + Text Real-time Tweet map
  46. 46. Gmail was down Jan 24, 2014 Geo + Text Real-time Tweet map
  47. 47. Nelson Mandela passed away Dec 5, 2013 Geo + Text Real-time Tweet map
  48. 48. What? Where? When? GEO TIME TEXT Visualize Tweets
  49. 49. Time + Text UEFA Champions League Biggest tournament for European soccer clubs Many Tweets during the matches
  50. 50. UEFA Champions League Dortmund Bayern Munich Count Tweets mentioning the teams every minute Team 1 Team 2 Time + Text
  51. 51. Time + Text UEFA Champions League
  52. 52. + “goal” count + context Time + Text UEFA Champions League
  53. 53. + “offside” Time + Text UEFA Champions League
  54. 54. + players Time + Text UEFA Champions League
  55. 55. A B C D A C C Competition Tree vs vs vs + = uclfinal.twitter.com
  56. 56. What? Where? When? GEO TIME TEXT Visualize Tweets
  57. 57. Time + Text + Geo State of the Union twitter.github.io/interactive/sotu2014
  58. 58. 1) timeline + topic from Tweets 4) Density map of Tweets about selected topic 3) Volume of Tweets by topics during selected part of the SOTU 2) context (speech) twitter.github.io/interactive/sotu2014 Time + Text + Geo State of the Union
  59. 59. World Cup 2014Time + Text interactive.twitter.com/wccompetitree
  60. 60. Time + Text + Geo World Cup 2014 interactive.twitter.com/wccompetitree
  61. 61. What? Where? When? GEO TIME TEXT Visualize Tweets
  62. 62. What? Where? When? GEO TIME TEXT Visualize Tweets + Non-Twitter data CONTEXT
  63. 63. Time + Text New Year 2014
  64. 64. Time + Text New Year 2014
  65. 65. Time + Text + Geo (c) New Year 2014 twitter.github.io/interactive/newyear2014/
  66. 66. Analytics Tools2 Data sources Output explore analyze present get * *
  67. 67. Analytics Tools2 Data sources Output explore analyze present get * *
  68. 68. Analytics Tools2 Data sources Output explore analyze present get * * ad-hoc scripts
  69. 69. Analytics Tools2 Data sources Output explore analyze present get * * ad-hoc scripts tools for exploration
  70. 70. User activity logs
  71. 71. UsersUseTwitter
  72. 72. UsersUse Product Managers Curious Twitter
  73. 73. UsersUse Curious Engineers Log data in Hadoop Write Twitter Instrument Product Managers
  74. 74. What are being logged? tweet activities
  75. 75. What are being logged? tweet from home timeline on twitter.com tweet from search page on iPhone activities
  76. 76. What are being logged? tweet from home timeline on twitter.com tweet from search page on iPhone sign up log in retweet etc. activities
  77. 77. Organize?
  78. 78. log event a.k.a. “client event” [Lee et al. 2012]
  79. 79. log event a.k.a. “client event” client : page : section : component : element : action web : home : timeline : tweet_box : button : tweet 1) User ID 2) Timestamp 3) Event name 4) Event detail [Lee et al. 2012]
  80. 80. Log data
  81. 81. UsersUse Curious Engineers Log data in Hadoop Twitter Instrument Write Product Managers bigger than Tweet data
  82. 82. UsersUse Curious Engineers Log data in Hadoop Data Scientists Ask Twitter Instrument Write Product Managers
  83. 83. UsersUse Curious Engineers Log data in Hadoop Data Scientists Find Ask Twitter Instrument Write Product Managers
  84. 84. Log data
  85. 85. UsersUse Curious Engineers Log data in Hadoop Data Scientists Find, Clean Ask Twitter Instrument Write Product Managers
  86. 86. UsersUse Curious Engineers Log data in Hadoop Data Scientists Find, Clean Ask Monitor Twitter Instrument Write Product Managers
  87. 87. UsersUse Curious Engineers Log data in Hadoop Data Scientists Find, Clean, Analyze Ask Monitor Twitter Instrument Write Product Managers
  88. 88. Log data EngineersData Scientists Usersin Hadoop Find, Clean, Analyze Use Monitor Ask Curious 1 2 Twitter Instrument Write Product Managers
  89. 89. Part I Find & Monitor Client Events
  90. 90. Motivation
  91. 91. Log data in Hadoop Engineers & Data Scientists billions of rows
  92. 92. Log data in Hadoop Aggregate 10,000+ event types date client page section comp. elem. action count 20141011 web home home - - impression 100 20141011 web home wtf - - click 20 Engineers & Data Scientists Client event collection
  93. 93. Log data in Hadoop Aggregate 10,000+ event types date client page section comp. elem. action count 20141011 web home home - - impression 100 20141011 web home wtf - - click 20 Engineers & Data Scientists Client event collection (Who-to-Follow)
  94. 94. Log data in Hadoop Aggregate Client event collection Engineers & Data Scientists
  95. 95. Log data in Hadoop Aggregate Find client page section component element action Search Client event collection Engineers & Data Scientists
  96. 96. Log data in Hadoop Aggregate Find client page section component element action Search Client event collection Engineers & Data Scientists
  97. 97. section? component? element?
  98. 98. client page section component element action Search Find Log data in Hadoop Aggregate web home * * impression* Client event collection Engineers & Data Scientists
  99. 99. client page section component element action Search Find Query Return Log data in Hadoop Results web : home : home : - : - : impression web : home : wtf : - : - : impression Aggregate web home * * impression* Client event collection Engineers & Data Scientists
  100. 100. client page section component element action Search Find Query Return Log data in Hadoop Results web : home : home : - : - : impression web : home : wtf : - : - : impression Aggregate search can be better Client event collection Engineers & Data Scientists
  101. 101. client page section component element action Search Find Query Return Log data in Hadoop Results web : home : home : - : - : impression web : home : wtf : - : - : impression Aggregate 10,000+ event types search can be better Client event collection Engineers & Data Scientists
  102. 102. client page section component element action Search Find Query Return Log data in Hadoop Results web : home : home : - : - : impression web : home : wtf : - : - : impression Aggregate search can be better 10,000+ event types not everybody knows What are all sections under web:home? Client event collection Engineers & Data Scientists
  103. 103. client page section component element action Search Find Query Return Log data in Hadoop Results web : home : home : - : - : impression Aggregate search can be better one graph / event 10,000+ event types not everybody knows What are all sections under web:home? Client event collection Engineers & Data Scientists
  104. 104. client page section component element action Search Find Query Return Log data in Hadoop Results web : home : home : - : - : impression Aggregate search can be better one graph / event x 10,000 10,000+ event types not everybody knows What are all sections under web:home? Client event collection Engineers & Data Scientists
  105. 105. ! • Search for client events • Explore client event collection • Monitor changes Goals
  106. 106. Design
  107. 107. Client event collection Engineers & Data Scientists
  108. 108. See Client event collection Engineers & Data Scientists
  109. 109. See Interactions search box => filter Client event collection narrow down Engineers & Data Scientists
  110. 110. See How to visualize? narrow down Client event collection Engineers & Data Scientists Interactions search box => filter
  111. 111. See How to visualize? narrow down Client event collection Engineers & Data Scientists client : page : section : component : element : actionInteractions search box => filter
  112. 112. Client event hierarchy iphone home - - - impression tweet tweet click iphone:home:-:-:-:impression iphone:home:-:tweet:tweet:click
  113. 113. Detect changes iphone home - - - impression tweet tweet click iphone home - - - impression tweet tweet click TODAY 7 DAYS AGO compared to
  114. 114. Calculate changes +5% +5% +5% +10% +10% +10% -5% -5% -5% DIFF
  115. 115. Display changes iphone home - - - impression tweet tweet click Map of the Market [Wattenberg 1999], StemView [Guerra-Gomez et al. 2013]
  116. 116. Display changes home - - - impression tweet tweet click iphone
  117. 117. Demo Scribe Radar
  118. 118. Twitter for Banana
  119. 119. Part II Analysis
  120. 120. Count page visits banana : home : - : - : - : impression home page
  121. 121. Funnel home page profile page
  122. 122. Funnel analysis banana : home : - : - : - : impression banana : profile : - : - : - : impression 1 jobhome page profile page 1 hour
  123. 123. Funnel analysis banana : home : - : - : - : impression banana : profile : - : - : - : impression banana : search : - : - : - : impression home page profile page search page 2 jobs 2 hours
  124. 124. Funnel analysis banana : home : - : - : - : impression banana : profile : - : - : - : impression banana : search : - : - : - : impression home page profile page search page Specify all funnels manually! n jobs n hours
  125. 125. Goal banana : home : - : - : - : impression … …… 1 job => all funnels, visualized home page
  126. 126. • Visualize an overview of event sequences ! • Big data? eBay checkout sequences Related work [Wongsuphasawat et al. 2011, Monroe et al. 2013, …] [Shen et al. 2013]
  127. 127. User sessions Session#1 A B start end Session#4 start end A Session#2 B start end A Session#3 C start end A
  128. 128. Aggregate 4 sessions A BB C start end endend A A end A
  129. 129. Aggregate A BB C start end endend end 4 sessions
  130. 130. Aggregate C start end endend end A B 4 sessions
  131. 131. Aggregate C start end endend end A B 4 sessions
  132. 132. Aggregate C start end endend A B end 4 sessions
  133. 133. Aggregate C start endend A B end 4 sessions
  134. 134. Aggregate C start endend A B end 4 sessions
  135. 135. Aggregate start endend A CB end 4 sessions
  136. 136. Aggregate 4,000,000 sessions endend A CB end start
  137. 137. Twitter for Banana
  138. 138. try with sample data (~millions sessions, 10,000+ event types) ! original paper (100,000 sessions, ~10 event types)
  139. 139. fail…
  140. 140. How to make it work?
  141. 141. # of unique sequences
  142. 142. 1. Reduce event types Reduce # of unique sequences
  143. 143. 1. Reduce event types Reduce # of unique sequences 10,000 types select tweet sign up log out
  144. 144. 1. Reduce event types Reduce # of unique sequences 10,000 types select tweet sign up log out
  145. 145. 1. Reduce event types Reduce # of unique sequences 10,000 types select merge tweet from home timeline tweet from search page tweet … = tweet
  146. 146. 1. Reduce event types 2. Reduce sequence length Reduce # of unique sequences
  147. 147. 1. Reduce event types 2. Reduce sequence length Reduce # of unique sequences session 1000 events
  148. 148. 1. Reduce event types 2. Reduce sequence length Reduce # of unique sequences session 10 events after (window size & direction) 1000 events visit home page (alignment)
  149. 149. 1. Reduce event types 2. Reduce sequence length Reduce # of unique sequences Ask users for input}
  150. 150. 1. Reduce event types 2. Reduce sequence length 3. More aggregation on Hadoop Reduce # of unique sequences Ask users for input}
  151. 151. Collapse events Sequence ABBBCCCC ABBCC ABC ABCCCC ABCD ABCCCD ABCCE ABCDF ABCDG ABCDH e.g. tweet, tweet, tweet, … = tweet
  152. 152. Sequence ABC ABC ABC ABC ABCD ABCD ABCE ABCDF ABCDG ABCDH Collapse events
  153. 153. Group & Count Sequence ABC ABCD ABCE ABCDF ABCDG ABCDH … Count 2000 80 20 1 1 1 …
  154. 154. Group & Count Sequence ABC ABCD ABCE ABCDF ABCDG ABCDH ABCDI ABCDJK ABCDJL Count 2000 80 20 1 1 1 1 1 1 rare sequences (count < threshold)
  155. 155. Truncate Sequence ABC ABCD ABCE ABCDx ABCDx ABCDx ABCDx ABCDJx ABCDJx Count 2000 80 20 1 1 1 1 1 1 Replace last event with x (…)
  156. 156. Sequence ABC ABCD ABCE ABCDx ABCDJx Count 2000 80 20 4 2 Group & Count
  157. 157. Truncate more Sequence ABC ABCD ABCE ABCDx ABCDx Count 2000 80 20 4 2
  158. 158. Group & Count Sequence ABC ABCD ABCE ABCDx Count 2000 80 20 6
  159. 159. 1. Define set of events 2. Pick alignment, direction and window size 3. Run Hadoop job (with more aggregation) 4. Wait for it… (2+ hrs) 5. Visualize Final process ~100,000 patterns (10MB) gazillion patterns (TBs)
  160. 160. Demo Flying Sessions
  161. 161. • Large-scale User Activity Logs + Visual Analytics • Used in day-to-day operations at Twitter • Generalize to smaller systems Summary Challenge big data small data visualize & interact aggregate & sacrifice
  162. 162. Data sources Output Creative3 …
  163. 163. https://medium.com/@kristw/designing-the-game-of-tweets-7f87c30dc5a2 Demo / Game of Tweets
  164. 164. To understand the world and share the stories To understand Twitter users and improve the service To showcase the data and inspire Projects Storytelling Analytics Tools Creative
  165. 165. Oh no…. NOT AGAIN
  166. 166. To understand the world and share the stories To understand Twitter users and improve the service To showcase the data and inspire Projects Storytelling Analytics Tools Creative Reusable Toolkits To implement once and for all
  167. 167. Coming soon Demo / Labella.js
  168. 168. https://github.com/twitter/d3kit Demo / d3Kit http://www.slideshare.net/kristw/d3kit
  169. 169. Conclusions Data are everywhere. Many applications: 
 Journalism, Product development, Art, etc. Combine visualization with other skills: 
 HCI, Design, Stats, ML, etc. Don’t repeat yourself. Krist Wongsuphasawat / @kristw interactive.twitter.com kristw.yellowpigz.com
  170. 170. @philogb @trebor @miguelrios @smrogers @lintool @linuslee @chuangl4 and many other colleagues at @twitter Acknowledgement
  171. 171. Conclusions Data are everywhere. Many applications: 
 Journalism, Product development, Art, etc. Combine visualization with other skills: 
 HCI, Design, Stats, ML, etc. Don’t repeat yourself. Krist Wongsuphasawat / @kristw interactive.twitter.com kristw.yellowpigz.com
  172. 172. Thank you
  173. 173. Questions?

×