Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What to expect when you are visualizing (v.2)

288 views

Published on

Guest lecture at UNC Chapel Hill Nov 8, 2016

Published in: Data & Analytics
  • Be the first to comment

What to expect when you are visualizing (v.2)

  1. 1. WHAT TO EXPECT WHEN YOU ARE VISUALIZING Krist Wongsuphasawat / @kristw Based on true stories Forever querying Never-ending cleaning Hopelessly prototyping Last minute coding and many more…
  2. 2. Computer Engineer Bangkok, Thailand Chulalongkorn University Krist Wongsuphasawat / @kristw
  3. 3. Programming + Soccer Computer Engineer Bangkok, Thailand Krist Wongsuphasawat / @kristw
  4. 4. Programming + Soccer Computer Engineer Bangkok, Thailand Krist Wongsuphasawat / @kristw
  5. 5. (P.S. These are actually not my robots, but our competitors’.) Krist Wongsuphasawat / @kristw Computer Engineer Bangkok, Thailand
  6. 6. Krist Wongsuphasawat / @kristw Computer Engineer Bangkok, Thailand PhD in Computer Science Information Visualization Univ. of Maryland
  7. 7. Krist Wongsuphasawat / @kristw Computer Engineer Bangkok, Thailand IBM Microsoft PhD in Computer Science Information Visualization Univ. of Maryland
  8. 8. PhD in Computer Science Information Visualization Univ. of Maryland IBM Microsoft Data Visualization Scientist Twitter Krist Wongsuphasawat / @kristw Computer Engineer Bangkok, Thailand
  9. 9. #interactive visualizations Open-source projects Visual Analytics Tools
  10. 10. DATA =ME+ VIS
  11. 11. Me clients, data, requirements, etc.
  12. 12. WHAT TO EXPECT?
  13. 13. 1. EXPECT POTENTIAL MISMATCHES
  14. 14. INPUT (DATA) What clients think they have
  15. 15. INPUT (DATA) What clients think they have What they usually have
  16. 16. YOU What clients think you are
  17. 17. YOU What clients think you are What they will get
  18. 18. OUTPUT (VIS) What clients ask for
  19. 19. OUTPUT (VIS) What clients ask for What they really need
  20. 20. COMMUNICATE
  21. 21. I need this. Take this.
  22. 22. I need this. Here you are. I need this. Take this.
  23. 23. & COMPROMISE
  24. 24. 2. EXPECT DIFFERENT REQUIREMENTS
  25. 25. DIFFERENT GOALS Present Communicate information effectively Explore Exploratory analysis, Reusable tools for exploration Explore + Present Analyze data + tell story Enjoy More flexible
  26. 26. DIFFERENT GOALS Present Communicate information effectively Explore Exploratory analysis, Reusable tools for exploration Explore + Present Analyze data + tell story Enjoy More flexible
  27. 27. 3. EXPECT TO CLEAN DATA
  28. 28. DATA SOURCES Open data Publicly available Internal data Private, owned by clients’ organization Self-collected data Manual, site scraping, etc. Combine the above
  29. 29. MANY FORMS OF DATA Standalone files txt, csv, tsv, json, Google Docs, …, pdf* APIs better quality with more overhead Databases doesn’t necessary mean they are organized Big data bigger pain
  30. 30. HAVING ALL TWEETS How people think I feel.
  31. 31. How people think I feel. How I really feel. HAVING ALL TWEETS
  32. 32. CHALLENGES Get relevant Tweets hashtag: #oscars keywords: “spotlight” (movie name) Too big Need to aggregate & reduce size Slow Long processing time (hours)
  33. 33. Hadoop Cluster GETTING BIG DATA Data Storage
  34. 34. Pig / Scalding (slow) GETTING BIG DATA Hadoop Cluster Data Storage Tool
  35. 35. Hadoop Cluster Pig / Scalding (slow) GETTING BIG DATA Data Storage Tool
  36. 36. Pig / Scalding (slow) GETTING BIG DATA Hadoop Cluster Data Storage Tool Your laptop Smaller dataset
  37. 37. Hadoop Cluster Pig / Scalding (slow) Data Storage Tool Final dataset Tool node.js / python / excel (fast) Your laptop GETTING BIG DATA Smaller dataset
  38. 38. CLEANING Data come in different formats. tsv to json Quality of data collection. null, missing data, typos, timestamp Filter Remove unnecessary data Conversion Change country code from 3-letter (USA) to 2-letter (US) Correct time of day based on users’ timezone Convert lat/lon to county etc.
  39. 39. 4. EXPECT TO CLEAN DATA A LOT
  40. 40. 70-80% of time cleaning data “DATA JANITOR”
  41. 41. WHY? Definition of “clean” depends on the task. e.g. Restaurant reviews
  42. 42. USER RESTAURANT RATING ======================== A MCDONALD’S 3 B MCDONALDS 3 C MCDONALD 4 D MCDONALDS 5 E IHOP 4 F SUBWAY 4
  43. 43. WHY? Definition of “clean” depends on the task. e.g. Restaurant reviews Data issue can present itself anytime. in the project timeline
  44. 44. RAMSAY & RAMSEY
  45. 45. WHY? Definition of “clean” depends on the task. e.g. Restaurant reviews Data issue can present itself anytime. in the project timeline It takes time to process data. Run. Wait… Oops! Re-run. Wait…
  46. 46. RECOMMENDATIONS Always think that you will have to do it again document the process, automation Reusable scripts break a gigantic do-it-all function into smaller ones Reusable data keep for future project
  47. 47. 5. EXPECT TO TRY AND BREAK THINGS
  48. 48. https://twitter.com/hashtag/ d3brokeandmadeart #D3BROKEANDMADEART
  49. 49. 6. EXPECT TO ITERATE UNTIL IT WORKS
  50. 50. 7. EXPECT DEADLINE
  51. 51. EXAMPLE PROJECTS
  52. 52. EXAMPLE 1: STORYTELLING
  53. 53. WHAT TO EXPECT timely Deadline is strict. Also can be unexpected events. wide audience easy to explain and understand, multi-device support one-off projects content screening
  54. 54. from fans’ conversations Reveal the talking points of every episode of
  55. 55. Problem is coming. CHAPTER I
  56. 56. Problem Want to know what the audience talk about a TV show from Tweets
  57. 57. HBO’s Game of Thrones Based on a book series “A Song of Ice and Fire” Medieval Fantasy. Knights, magic and dragons.
  58. 58. Brief Story
  59. 59. A King dies.  A lot of contenders wage a war to reclaim the throne.
  60. 60. Minor characters with no claim to the throne set their own plans in action to gain power when all the major characters end up killing each other.
  61. 61. Brave/Honest/Honorable characters die. Intelligent but shady characters and characters who know nothing continue to live.
  62. 62. While humans are busy killing each other, ice zombies “White walkers” are invading from the North. The only group who seems to care about this is neutral group called the Night’s Watch.
  63. 63. HBO’s Game of Thrones Based on a book series “A Song of Ice and Fire” Medieval Fantasy. Knights, magic and dragons. Many characters. Anybody can die. 6 seasons (60 episodes) so far Multiple storylines in each episode
  64. 64. Problem Want to know what the audience talk about a TV show from Tweets
  65. 65. Ideas Common words Too much noise
  66. 66. Ideas Common words Too much noise Characters How o!en each character were mentioned?
  67. 67. I demand a trial by prototyping. CHAPTER II
  68. 68. Prototyping Pull sample data from Twitter API Entity recognition and counting naive approach
  69. 69. List of names Daenerys Targaryen,Khaleesi Jon Snow Sansa Stark Tyrion Lannister Arya Stark Cersei Lannister Khal Drogo Gregor Clegane,Mountain Margaery Tyrell Joffrey Baratheon Bran Stark Theon Greyjoy Jaime Lannister Brienne Eddard Stark,Ned Stark Ramsay Bolton Sandor Clegane,Hound Ygritte Stannis Baratheon Petyr Baelish,Little Finger Robb Stark Bronn Varys Catelyn Stark Oberyn Martell Daario Naharis Davos Seaworth Jorah Mormont Melisandre Myrcella Baratheon Tywin Lannister Tommen Baratheon Grey Worm Tyene Sand Rickon Stark Missandei Roose Bolton Robert Baratheon Jojen Reed Jeor Mormont Tormund Giantsbane Lysa Arryn Yara Greyjoy,Asha Greyjoy Samwell Tarly,Sam Hodor Victarion Greyjoy High Sparrow Dragon Winter Dothraki
  70. 70. Sample Tweet
  71. 71. Sample Tweet
  72. 72. Sample data Character Count Hodor 10000 Jon Snow 5000 Daenerys 4000 Bran Stark 3000 … … *These numbers are made up for presentation, not real data.
  73. 73. When you play the game of vis, you iterate or you die. CHAPTER III
  74. 74. Where to go from here?
  75. 75. + episodes The Guardian & Google Trends
 http://www.theguardian.com/news/datablog/ng-interactive/2016/apr/22/game-of-thrones-the-most-googled-characters-episode-by-episode
  76. 76. + emotion
  77. 77. + connections
  78. 78. + connections
  79. 79. Gain insights from a single episode emotion & connections
  80. 80. Sample data Character Count Jon Snow+Sansa 1000 Tormund+Brienne 500 Bran Stark+Hodor 300 … … Character Count Hodor 10000 Jon Snow 5000 Daenerys 4000 … … INDIVIDUALS CONNECTIONS + top emojis + top emojis *These numbers are made up for presentation, not real data.
  81. 81. Graph NODES LINKS + top emojis + top emojis Character Count Jon Snow+Sansa 1000 Tormund+Brienne 500 Bran Stark+Hodor 300 … … Character Count Hodor 1000 Jon Snow 500 Daenerys 400 … … *These numbers are made up for presentation, not real data.
  82. 82. Network Visualization Node-link diagram Force-directed layout http://blockbuilder.org/kristw/762b680690e4b2b2666dfec15838a384
  83. 83. Issue: Hairball
  84. 84. Why? Too many nodes & edges nodes = nodes.filter(n => n.count > 100) links = links.filter(l => l.count > 100) The force is (too) strong. force .charge(…) .gravity(…) .linkDistance(…) .linkStrength(…)
  85. 85. Issue: Occlusions
  86. 86. Tried: Fixed positions
  87. 87. + Collision Detection http://blockbuilder.org/kristw/2850f65d6329c5fef6d5c9118f1de6e6
  88. 88. + Community Detection https://github.com/upphiminn/jLouvain
  89. 89. + Collision Detection (with clusters) https://bl.ocks.org/mbostock/7881887
  90. 90. Tormund + Brienne
  91. 91. Issue: Convex hull http://bl.ocks.org/mbostock/4341699 d3.geom.hull(vertices)
  92. 92. x & y only, no radius
  93. 93. Example
  94. 94. Fix it
  95. 95. Fix it
  96. 96. Let’s get other episodes.
  97. 97. Hadoop remembers. CHAPTER IV
  98. 98. More data Hadoop Rewrite the scripts in Scalding to get archived data
  99. 99. How much data do we need? Whole week? 5 days? 2 days? A day? etc.
  100. 100. How much data do we need?
  101. 101. Transitions
  102. 102. not so smooth
  103. 103. A#er switching episode 1. Store old positions for existing objects. 2. Assign new initial positions.*
  104. 104. Initial positions Default: random Better starting points Heuristics based on degree of nodes
  105. 105. A#er switching episode 1. Store old positions for existing objects. 2. Assign new initial positions.* 3. Run simulation without updating <svg> for n rounds 4. Animate objects from old to new positions. 5. Resume simulation and update <svg> every tick.
  106. 106. Animate Nodes & Links Remove delay Move & Change size/thickness Add new
  107. 107. const selection = svg.selectAll('g.node') .data(nodes, d => d.entity.id); selection.exit() .transition() .duration(1000) .style('opacity', 0) .remove(); const sEnter = selection.enter().append('g') .classed('node', true) .attr('transform', d => `translate(${d.x},${d.y})`) .style('opacity', 0) .call(force.drag); sEnter.append('circle') .attr('r', d=>d.r) .style('fill', d => options.colorScale(d.entity.group)); const sTrans = selection.transition() .delay(1000) .duration(2000) .attr('transform', d => `translate(${d.x},${d.y})`) .style('opacity', 1) sTrans.select('circle') .attr('r', d=>d.r) Add “enter” nodes with opacity 0 After 1s delay, use transition to move nodes and fade in new nodes Fade “exit” nodes to opacity 0 and remove Create selection
  108. 108. Animate Communities Remove delay Move & Change shape* Add new http://blockbuilder.org/kristw/f9ffe87dd8b4038b5867e853c27cebb7
  109. 109. Default t=0 t=1
  110. 110. Smoother t=0 t=1t=0.5 t=0.51
  111. 111. Code // original path.attr('d', hull); // with custom interpolation path.attrTween('d', (d,i,currentAttr) => interpolateHull(d, currentAttr) )
  112. 112. Colors Default: d3.category10() Distinct but nothing about the context Custom palette Colors related to the groups/houses. Black = Night’s Watch Blue = North Red = Daenerys Gold = Lannister …
  113. 113. Hold the vis. CHAPTER V
  114. 114. The vis is not enough.
  115. 115. Legend
  116. 116. Navigation
  117. 117. Top 3
  118. 118. Adjust threshold
  119. 119. Recap
  120. 120. Filtered Recap Tooltip
  121. 121. Demo https://interactive.twitter.com/game-of-thrones
  122. 122. Mobile Support
  123. 123. A visualizer always evaluates his work. CHAPTER VI
  124. 124. “Feedback is the breakfast of champion.” — Ken Blanchard
  125. 125. Self & Peer Does it solve the problem?
  126. 126. Google Analytics Pageviews Visitors Actions Referrals Sites/Social
  127. 127. Feedback
  128. 128. Feedback
  129. 129. EXAMPLE 2: VISUAL ANALYTICS TOOLS
  130. 130. Data sources Output explore analyze present get * *
  131. 131. Data sources Output explore analyze present get * * ad-hoc scripts
  132. 132. Data sources Output explore analyze present get * * ad-hoc scripts tools for exploration
  133. 133. WHAT TO EXPECT richer, more features to support exploration of complex data more technical audience product managers, engineers, data scientists accuracy designed for dynamic input long-term projects
  134. 134. USER ACTIVITY LOGS
  135. 135. UsersUseTwitter
  136. 136. UsersUse Product Managers Curious Twitter
  137. 137. UsersUse Curious Engineers Log data in Hadoop Write Twitter Instrument Product Managers
  138. 138. WHAT ARE BEING LOGGED? tweet activities
  139. 139. WHAT ARE BEING LOGGED? tweet from home timeline on twitter.com tweet from search page on iPhone activities
  140. 140. WHAT ARE BEING LOGGED? tweet from home timeline on twitter.com tweet from search page on iPhone sign up log in retweet etc. activities
  141. 141. ORGANIZE?
  142. 142. LOG EVENT A.K.A. “CLIENT EVENT” [Lee et al. 2012]
  143. 143. LOG EVENT A.K.A. “CLIENT EVENT” client : page : section : component : element : action web : home : timeline : tweet_box : button : tweet 1) User ID 2) Timestamp 3) Event name 4) Event detail [Lee et al. 2012]
  144. 144. LOG DATA
  145. 145. UsersUse Curious Engineers Log data in Hadoop Twitter Instrument Write Product Managers bigger than Tweet data
  146. 146. UsersUse Curious Engineers Log data in Hadoop Data Scientists Ask Twitter Instrument Write Product Managers
  147. 147. UsersUse Curious Engineers Log data in Hadoop Data Scientists Find Ask Twitter Instrument Write Product Managers
  148. 148. LOG DATA
  149. 149. UsersUse Curious Engineers Log data in Hadoop Data Scientists Find, Clean Ask Twitter Instrument Write Product Managers
  150. 150. UsersUse Curious Engineers Log data in Hadoop Data Scientists Find, Clean Ask Monitor Twitter Instrument Write Product Managers
  151. 151. UsersUse Curious Engineers Log data in Hadoop Data Scientists Find, Clean, Analyze Ask Monitor Twitter Instrument Write Product Managers
  152. 152. Log data EngineersData Scientists Usersin Hadoop Find, Clean, Analyze Use Monitor Ask Curious 1 2 Twitter Instrument Write Product Managers
  153. 153. Scribe Radar Project / Find & Monitor client events
  154. 154. GOALS Search for client events Explore client event collection Monitor changes
  155. 155. CLIENT EVENT HIERARCHY iphone home - - - impression tweet tweet click iphone:home:-:-:-:impression iphone:home:-:tweet:tweet:click
  156. 156. DETECT CHANGES iphone home - - - impression tweet tweet click iphone home - - - impression tweet tweet click TODAY 7 DAYS AGO compared to
  157. 157. CALCULATE CHANGES +5% +5% +5% +10% +10% +10% -5% -5% -5% DIFF
  158. 158. DISPLAY CHANGES iphone home - - - impression tweet tweet click Map of the Market [Wattenberg 1999], StemView [Guerra-Gomez et al. 2013]
  159. 159. DISPLAY CHANGES home - - - impression tweet tweet click iphone
  160. 160. Demo Demo Demo Demo / Scribe Radar
  161. 161. Twitter for Banana
  162. 162. WORKFLOW Requested / Identify needs Design & Prototype Make it work for sample dataset Refine & Generalize Productionize Document & Release Maintain & Support Keep it running, Feature requests & Bugs fix
  163. 163. 8. EXPECT TO REFINE AND POLISH
  164. 164. REFINE & POLISH UX / UI Color Animation Mobile support Performance Loading time, Data file size “The little of visualisation design” by Andy Kirk http://www.visualisingdata.com/2016/03/little-visualisation-design/
  165. 165. 9. EXPECT TO GET FEEDBACK
  166. 166. FEEDBACK Logging User study Forum, User group Office hours
  167. 167. 10. EXPECT TO IMPROVE
  168. 168. HOW TO BE BETTER? Time is limited.
 Grow the team Expand skills Improve tooling Solve a problem once and for all Automate repetitive tasks
  169. 169. http://twitter.github.io/labella.js Demo / Labella.js
  170. 170. https://github.com/twitter/d3kit Demo / d3Kit http://www.slideshare.net/kristw/d3kit
  171. 171. yeoman.io Demo / Yeoman
  172. 172. SUMMARY
  173. 173. INPUT YOU OUTPUT
  174. 174. EXPECT 1) potential mismatches 2) different requirements 3) to clean data 4) to clean data a lot 5) to try and break things Krist Wongsuphasawat / @kristw kristw.yellowpigz.com 6) to iterate until it works 7) deadline 8) to refine and polish 9) to get feedback 10) to improve
  175. 175. #VOTE
  176. 176. Nicolas Garcia Belmonte, Robert Harris, Miguel Rios, Simon Rogers, Jimmy Lin, Linus Lee, Chuang Liu, and many colleagues at Twitter. ACKNOWLEDGEMENT
  177. 177. RESOURCES Images Banana phone http://goo.gl/GmcMPq Bar chart https://goo.gl/1G1GBg Boss https://goo.gl/gcY8Kw Champions League http://goo.gl/DjtNKE Database http://goo.gl/5N7zZz Fishing shark http://goo.gl/2fp4zW Globe visualization http://goo.gl/UiGMMj Harry Potter http://goo.gl/Q9Cy64 Holding phone http://goo.gl/It2TzH Kiwi orange http://goo.gl/ejQ73y Kiwi http://goo.gl/9yk7o5 Library https://goo.gl/HVeE6h Library earthquake http://goo.gl/rBqBrs Minion http://goo.gl/I19Ijg NBA http://goo.gl/p7HBdG NFL http://goo.gl/feQMZs Orange & Apple http://goo.gl/NG6RIL Pile of paper http://goo.gl/mGLQTx Premier League http://goo.gl/AqIINO Scrooge McDuck https://goo.gl/aKv8D7 The Sound of Music https://goo.gl/dqHlzj Trash pile http://goo.gl/OsFfo3 Tyrion http://goo.gl/WaBonl Watercolor Map by Stamen Design
  178. 178. THANK YOU
  179. 179. QUESTIONS?

×