Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

6 things to expect when you are visualizing (2020 Edition)

168 views

Published on

This talk was prepared as a note to my future self when working on future projects. I reflect on the tasks commonly involved in crafting visualizations, point out the common things to expect, pitfalls and provide recommendations. Along the way I include examples of different applications of information/data visualization and details on how each project was started and developed.

These slides were from my (remote) guest lecture in InfoVis class for UC Berkeley iSchool on Apr 8, 2020 during the COVID-19 shelter-in-place. Thank you Prof. Marti Hearst for the invitation.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

6 things to expect when you are visualizing (2020 Edition)

  1. 1. Krist Wongsuphasawat / @kristw 6 THINGS TO EXPECT WHEN YOU ARE VISUALIZING
  2. 2. 6 THINGS TO EXPECT WHEN YOU ARE VISUALIZING Krist Wongsuphasawat / @kristw
  3. 3. Computer Engineer Bangkok, Thailand Chulalongkorn University Krist Wongsuphasawat / @kristw
  4. 4. Programming + Soccer Computer Engineer Bangkok, Thailand Krist Wongsuphasawat / @kristw
  5. 5. Programming + Soccer Computer Engineer Bangkok, Thailand Krist Wongsuphasawat / @kristw
  6. 6. (P.S. These are actually not my robots, but our competitors’.) Krist Wongsuphasawat / @kristw Computer Engineer Bangkok, Thailand
  7. 7. Krist Wongsuphasawat / @kristw Computer Engineer Bangkok, Thailand PhD in Computer Science Information Visualization Univ. of Maryland
  8. 8. Krist Wongsuphasawat / @kristw Computer Engineer Bangkok, Thailand IBM Microsoft PhD in Computer Science Information Visualization Univ. of Maryland
  9. 9. PhD in Computer Science Information Visualization Univ. of Maryland IBM Krist Wongsuphasawat / @kristw Computer Engineer Bangkok, Thailand Data Scientist Analytics, Experiment Twitter Microsoft
  10. 10. PhD in Computer Science Information Visualization Univ. of Maryland IBM Krist Wongsuphasawat / @kristw Computer Engineer Bangkok, Thailand Engineering Manager Data Experience Airbnb Microsoft Twitter
  11. 11. #interactive visualizations Open-source projects Visual Analytics Tools interactive.twitter.com Apache Superset committer labella.js (3000+ stars) react-vega Internal tools Academic paperskristw.yellowpigz.com
  12. 12. DATA =ME+ VIS
  13. 13. Data, I’m ready!
  14. 14. Data, I’m ready! Here I come!
  15. 15. WHAT TO EXPECT?
  16. 16. 1. EXPECT TO FIND THE REAL NEED
  17. 17. INPUT (DATA) What clients think they have
  18. 18. INPUT (DATA) What clients think they have What they usually have
  19. 19. YOU What clients think you are
  20. 20. YOU What clients think you are What they will get
  21. 21. OUTPUT (VIS) What clients ask for
  22. 22. OUTPUT (VIS) What clients ask for What they really need
  23. 23. COMMUNICATE
  24. 24. GOALS Present data Communicate information effectively Analyze data Exploratory data analysis Tools to analyze data Reusable tools for exploration Enjoy Combination of above
  25. 25. GOALS Present data Communicate information effectively Analyze data Exploratory data analysis Tools to analyze data Reusable tools for exploration Enjoy Combination of above Who are the audience? What do you want to tell? What are the questions? Who will use this? What would they use this for? Who are the audience?
  26. 26. I need this. Take this.
  27. 27. I need this. Here you are. I need this. Take this.
  28. 28. & COMPROMISE
  29. 29. 2. EXPECT TO CLEAN DATA
  30. 30. 2. EXPECT TO CLEAN DATA A LOT
  31. 31. 70-80% of time cleaning data “DATA JANITOR”
  32. 32. Collect + Clean + Transform DATA WRANGLING
  33. 33. WHY DOES IT TAKE SO MUCH TIME?
  34. 34. 2.1 Many sources and data format
  35. 35. DATA SOURCES Open data Publicly available Internal data Private, owned by clients’ organization Self-collected data Manual, site scraping, etc. Combine the above
  36. 36. DATA FORMAT Standalone files txt, csv, tsv, json, Google Docs, …, pdf* Databases doesn’t necessary mean they are organized API better quality with more overhead Website Big data*
  37. 37. NEED TO… Change format e.g. tsv => json Combine data Resolve multiple sources of truth
  38. 38. 2.2 Data transformation is needed.
  39. 39. EXAMPLES Convert latitude/longitude into zip code Change country code from 3-letter (USA) to 2-letter (US) Correct time of day based on users’ timezone etc.
  40. 40. 2.3 Data collection issues
  41. 41. EXAMPLES Typos Incorrect values Incorrect timestamps Missing data
  42. 42. 2.4 Definition of “clean” data
  43. 43. IS THIS CLEAN? USER RESTAURANT RATING ======================== A MCDONALD’S 3 B MCDONALDS 3 C MCDONALD 4 D MCDONALDS 5 E IHOP 4 F SUBWAY 4
  44. 44. IS THIS CLEAN? USER RESTAURANT RATING ======================== A MCDONALD’S 3 B MCDONALDS 3 C MCDONALD 4 D MCDONALDS 5 E IHOP 4 F SUBWAY 4 How many reviews are there? Clean. How many restaurants are there? Not clean. McDonald, McDonald’s, McDonalds
  45. 45. 2.5 Bigger data, bigger problems
  46. 46. HAVING ALL TWEETS How people think I feel.
  47. 47. How people think I feel. How I really feel. HAVING ALL TWEETS
  48. 48. Hadoop Cluster GETTING BIG DATA Data Storage
  49. 49. Scalding (slow) GETTING BIG DATA Hadoop Cluster Data Storage Tool
  50. 50. Scalding (slow) GETTING BIG DATA Hadoop Cluster Data Storage Tool Your laptop Smaller dataset
  51. 51. Hadoop Cluster Scalding (slow) Data Storage Tool Final dataset Tool node.js / python / excel (fast) Your laptop GETTING BIG DATA Smaller dataset
  52. 52. CHALLENGES Slow Long processing time (hours) Get relevant Tweets hashtag: #oscars keywords: “parasite” (movie name) Too big Need to aggregate & reduce size Harder to spot problems
  53. 53. CHALLENGES Slow Long processing time (hours) Get relevant Tweets hashtag: #oscars keywords: “parasite” (movie name) Too big Need to aggregate & reduce size Harder to spot problems
  54. 54. RAMSAY & RAMSEY
  55. 55. 2.6 New issues can show up any time.
  56. 56. RECOMMENDATIONS Always think that you will have to do it again document the process, automation Reusable scripts break a gigantic do-it-all function into smaller ones Reusable data keep for future project
  57. 57. 3. PREPARE TO ITERATE
  58. 58. It was a great idea … until I actually tried it.
  59. 59. Celebrate your failures #D3BrokeAndMadeArt
  60. 60. TIPS Don’t give up. If stuck, look for inspirations. The vis that gives you insights may or may not be the best vis for sharing. Exploration vs. Communication Keep it as simple as possible but not simpler.
  61. 61. “Necessity is the mother of invention.” — English Proverb
  62. 62. “Necessity is the mother of invention.” — English Proverb DEADLINE
  63. 63. TIPS Don’t give up. If stuck, look for inspirations. The vis that gives you insights may or may not be the best vis for sharing. Exploration vs. Communication Keep it as simple as possible but not simpler. Set milestones and deadline.
  64. 64. PROJECTS
  65. 65. STORYTELLING PROJECTS timely Deadline is strict. Also can be unexpected events. wide audience easy to explain and understand, multi-device support one-off project scope analyze data to find stories and find best way to present them
  66. 66. HAPPY NEW YEAR AROUND THE WORLD [ PROJECT ]
  67. 67. HAPPY NEW YEAR 2013 twitter.github.io/interactive/newyear2014/
  68. 68. BOBA SCIENCE [ PROJECT ]
  69. 69. https://medium.com/s/story/boba-science- how-can-i-drink-a-bubble-tea-to-ensure-that- i-dont-finish-the-tea-before-the- bobas-7fc5fd0e442d
  70. 70. GAME OF THRONES [ PROJECT ]
  71. 71. from fans’ conversations Reveal the talking points of every episode of
  72. 72. Problem is coming.
  73. 73. Problem Want to know what the audience talk about a TV show from Tweets
  74. 74. HBO’s Game of Thrones Based on a book series “A Song of Ice and Fire” Medieval Fantasy. Knights, magic and dragons.
  75. 75. Brief Story
  76. 76. A King dies.  A lot of contenders wage a war to reclaim the throne.
  77. 77. Minor characters with no claim to the throne set their own plans in action to gain power when all the major characters end up killing each other.
  78. 78. Brave/Honest/Honorable characters die. Intelligent but shady characters and characters who know nothing continue to live.
  79. 79. While humans are busy killing each other, ice zombies “White walkers” are invading from the North. The only group who seems to care about this is neutral group called the Night’s Watch.
  80. 80. HBO’s Game of Thrones Based on a book series “A Song of Ice and Fire” Medieval Fantasy. Knights, magic and dragons. Many characters. Anybody can die. 8 seasons Multiple storylines in each episode
  81. 81. Problem Want to know what the audience talk about a TV show from Tweets
  82. 82. Ideas Common words Too much noise
  83. 83. Ideas Common words Too much noise Characters How o!en each character were mentioned?
  84. 84. Prototyping Pull sample data from Twitter API Entity recognition and counting naive approach
  85. 85. List of names Daenerys Targaryen,Khaleesi Jon Snow Sansa Stark Tyrion Lannister Arya Stark Cersei Lannister Khal Drogo Gregor Clegane,Mountain Margaery Tyrell Joffrey Baratheon Bran Stark Theon Greyjoy Jaime Lannister Brienne Eddard Stark,Ned Stark Ramsay Bolton Sandor Clegane,Hound Ygritte Stannis Baratheon Petyr Baelish,Little Finger Robb Stark Bronn Varys Catelyn Stark Oberyn Martell Daario Naharis Davos Seaworth Jorah Mormont Melisandre Myrcella Baratheon Tywin Lannister Tommen Baratheon Grey Worm Tyene Sand Rickon Stark Missandei Roose Bolton Robert Baratheon Jojen Reed Jeor Mormont Tormund Giantsbane Lysa Arryn Yara Greyjoy,Asha Greyjoy Samwell Tarly,Sam Hodor Victarion Greyjoy High Sparrow Dragon Winter Dothraki
  86. 86. Sample Tweet
  87. 87. Sample Tweet
  88. 88. Sample data Character Count Hodor 10000 Jon Snow 5000 Daenerys 4000 Bran Stark 3000 … … *These numbers are made up for presentation, not real data.
  89. 89. Where to go from here?
  90. 90. + episodes The Guardian & Google Trends http://www.theguardian.com/news/datablog/ng-interactive/2016/apr/22/game-of-thrones-the-most-googled-characters-episode-by-episode
  91. 91. + emotion
  92. 92. + connections
  93. 93. + connections
  94. 94. Gain insights from a single episode emotion & connections
  95. 95. Sample data Character Count Jon Snow+Sansa 1000 Tormund+Brienne 500 Bran Stark+Hodor 300 … … Character Count Hodor 10000 Jon Snow 5000 Daenerys 4000 … … INDIVIDUALS CONNECTIONS + top emojis + top emojis *These numbers are made up for presentation, not real data.
  96. 96. Graph NODES LINKS + top emojis + top emojis Character Count Jon Snow+Sansa 1000 Tormund+Brienne 500 Bran Stark+Hodor 300 … … Character Count Hodor 1000 Jon Snow 500 Daenerys 400 … … *These numbers are made up for presentation, not real data.
  97. 97. Network Visualization Node-link diagram Force-directed layout http://blockbuilder.org/kristw/762b680690e4b2b2666dfec15838a384
  98. 98. Issue: Hairball
  99. 99. Issue: Occlusions
  100. 100. Tried: Fixed positions
  101. 101. + Collision Detection http://blockbuilder.org/kristw/2850f65d6329c5fef6d5c9118f1de6e6
  102. 102. + Community Detection https://github.com/upphiminn/jLouvain
  103. 103. + Collision Detection (with clusters) https://bl.ocks.org/mbostock/7881887
  104. 104. Tormund + Brienne
  105. 105. Let’s get other episodes.
  106. 106. More data Hadoop Rewrite the scripts in Scalding to get archived data
  107. 107. How much data do we need? Whole week? 5 days? 2 days? A day? etc.
  108. 108. How much data do we need?
  109. 109. Transitions
  110. 110. Changing episode
  111. 111. Community transition t=0 t=1
  112. 112. Smoother t=0 t=1t=0.5 t=0.51
  113. 113. Colors Default: D3 category10 Distinct but nothing about the context Custom palette Colors related to the groups/houses. Black = Night’s Watch Blue = North Red = Daenerys Gold = Lannister …
  114. 114. The vis is not enough.
  115. 115. Legend
  116. 116. Navigation
  117. 117. Top 3
  118. 118. Adjust threshold
  119. 119. Recap
  120. 120. Filtered Recap Tooltip
  121. 121. Demo https://interactive.twitter.com/game-of-thrones
  122. 122. Mobile Support
  123. 123. Self & Peer Does it solve the problem?
  124. 124. Google Analytics Pageviews Visitors Actions Referrals Sites/Social
  125. 125. Feedback
  126. 126. Feedback
  127. 127. ANALYTICS TOOLS
  128. 128. VISUAL ANALYTICS TOOL PROJECTS richer, more features to support exploration of complex data more technical audience product managers, engineers, data scientists accuracy designed for dynamic input long-term projects
  129. 129. PROJECT LIFECYCLE Identify needs Design and prototype Make it work for sample dataset Refine, generalize and productionize Make it work for other cases Document and release Maintain and support Keep it running, Feature requests & Bugs fix
  130. 130. VISUAL ANALYTICS FOR LOG EVENTS [ PROJECT ]
  131. 131. USER ACTIVITY LOGS
  132. 132. UsersUseTwitter
  133. 133. UsersUse Product Managers Curious Twitter
  134. 134. UsersUse Curious Engineers Log data in Hadoop Write Twitter Instrument Product Managers
  135. 135. WHAT ARE BEING LOGGED? tweet Activities
  136. 136. WHAT ARE BEING LOGGED? tweet from home timeline on twitter.com tweet from search page on iPhone Activities
  137. 137. WHAT ARE BEING LOGGED? tweet from home timeline on twitter.com tweet from search page on iPhone sign up log in retweet etc. Activities
  138. 138. ORGANIZE?
  139. 139. LOG EVENT A.K.A. “CLIENT EVENT” [Lee et al. 2012]
  140. 140. LOG EVENT A.K.A. “CLIENT EVENT” client : page : section : component : element : action web : home : timeline : tweet_box : button : tweet 1) User ID 2) Timestamp 3) Event name 4) Event detail [Lee et al. 2012]
  141. 141. LOG DATA
  142. 142. UsersUse Curious Engineers Log data in Hadoop Twitter Instrument Write Product Managers bigger than Tweet data
  143. 143. UsersUse Curious Engineers Log data in Hadoop Data Scientists Ask Twitter Instrument Write Product Managers
  144. 144. UsersUse Curious Engineers Log data in Hadoop Data Scientists Find Ask Twitter Instrument Write Product Managers
  145. 145. LOG DATA
  146. 146. UsersUse Curious Engineers Log data in Hadoop Data Scientists Find, Clean Ask Twitter Instrument Write Product Managers
  147. 147. UsersUse Curious Engineers Log data in Hadoop Data Scientists Find, Clean Ask Monitor Twitter Instrument Write Product Managers
  148. 148. UsersUse Curious Engineers Log data in Hadoop Data Scientists Find, Clean, Analyze Ask Monitor Twitter Instrument Write Product Managers
  149. 149. Log data EngineersData Scientists Usersin Hadoop Find, Clean, Analyze Use Monitor Ask Curious 1 2 Twitter Instrument Write Product Managers
  150. 150. client page section component element action Event 50,000+ event types
  151. 151. client page section component element action Event 50,000+ event types one graph / event x 50,000
  152. 152. DESIGN
  153. 153. CLIENT EVENT HIERARCHY iphone home - - - impression tweet tweet click iphone:home:-:-:-:impression iphone:home:-:tweet:tweet:click
  154. 154. DETECT CHANGES iphone home - - - impression tweet tweet click iphone home - - - impression tweet tweet click TODAY 7 DAYS AGO compared to
  155. 155. CALCULATE CHANGES +5% +5% +5% +10% +10% +10% -5% -5% -5% DIFF
  156. 156. DISPLAY CHANGES iphone home - - - impression tweet tweet click Map of the Market [Wattenberg 1999], StemView [Guerra-Gomez et al. 2013]
  157. 157. DISPLAY CHANGES home - - - impression tweet tweet click iphone
  158. 158. Demo Demo Demo Demo / Scribe Radar
  159. 159. Twitter for Banana
  160. 160. Details separate good and great work 4. RESERVE TIME FOR REFINEMENT
  161. 161. “The first 90% of the code accounts for the first 90% of the development time. The remaining 10% of the code accounts for the other 90% of the development time.” — Tom Cargill, Bell Labs
  162. 162. REFINE & POLISH UX / UI + Mobile Support Color Animation / Transition Metadata for SEO Social media preview images Performance Loading time, Data file size “The little of visualisation design” by Andy Kirk http://www.visualisingdata.com/2016/03/little-visualisation-design/
  163. 163. Example
  164. 164. Issue: Convex hull http://bl.ocks.org/mbostock/4341699
  165. 165. x & y only, no radius
  166. 166. Fix it
  167. 167. Fix it
  168. 168. Flatten the curve https://www.fastcompany.com/90476143/the-story-behind-flatten-the-curve-the-defining-chart-of-the-coronavirus
  169. 169. THE ORIGIN From a paper “Interim pre- pandemic planning guidance: community strategy for pandemic influenza mitigation in the United States: early, targeted, layered use of nonpharmaceutical interventions” published in 2007 by the CDC https://stacks.cdc.gov/view/cdc/11425
  170. 170. REVIVAL Rosamund Pearce, a data journalist at The Economist, rebuild it for a piece about COVID-19. Changed the labeling scheme to assist colorblind readers. https://www.economist.com/briefing/2020/02/29/covid-19-is-now-in-50-countries-and-things-will-get-worse
  171. 171. THE LINE Drew Harris, an assistant professor at the Thomas Jefferson University, came across the graphic in The Economist. He recalled using it a decade earlier as a pandemic preparedness trainer.  So he added the dotted line “healthcare system capacity” https://www.nytimes.com/article/flatten-curve-coronavirus.html
  172. 172. THEN IT WENT VIRAL
  173. 173. or find ways to get some 5. PLAN FOR FEEDBACK
  174. 174. “Feedback is the breakfast of champion.” — Ken Blanchard
  175. 175. FEEDBACK During development Feedback sessions with clients/potential users After release Logging User study Forum, User group Office hours
  176. 176. 6. LOOK BACK FOR IMPROVEMENT
  177. 177. HOW TO BE BETTER? Retrospective What could have been better? Wishlist Expand skillset Learning opportunities Get help Grow the team Improve tooling Solve a problem once and for all Automate repetitive tasks
  178. 178. REUSABLE WORK
  179. 179. LABELLA.JS [ PROJECT ]
  180. 180. GRID MAP [ PROJECT ]
  181. 181. COVID-19 Situation in Thailand by province
  182. 182. VX = REACT + D3 [ PROJECT ]
  183. 183. SUMMARY
  184. 184. 6 STEPS 1. 2. 3. 4. 5. 6. Krist Wongsuphasawat / @kristw kristw.yellowpigz.com Expect to find the real need Expect to clean data a lot Prepare to iterate Reserve time for refinement Plan for feedback Look back for improvement
  185. 185. My former and current colleagues at Twitter and Airbnb for their collaboration and support in these projects; and my wife for taking care of our two kids while I make these slides. ACKNOWLEDGEMENT
  186. 186. THANK YOU
  187. 187. QUESTIONS?

×