Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Visualization: A Quick Tour for Data Science Enthusiasts

25,463 views

Published on

This was a talk given at Chulalongkorn University in Bangkok, Thailand on January 16, 2015 as a part of CodeMania X2 "Data Science 101"

Published in: Data & Analytics

Data Visualization: A Quick Tour for Data Science Enthusiasts

  1. Krist Wongsuphasawat /@kristw visualizationdata A quick tour for data science enthusiasts
  2. visualizationdata What is it about? What is it good for? How is it related to data science? Example projects …
  3. 1. What is it about?
  4. “A picture is worth more than a thousand words.” — ใครสักคนได้กล่าวไว้
  5. Data Picture
  6. Data Visual display
  7. Help audience consume a lot of information rapidly Data Visual display
  8. 2. What is it good for?
  9. Example / History
  10. data
  11. location (lat,lon => x,y), quantity of troops (width), direction (color) time (x), temperature (y)
  12. Example / Cholera epidemic
  13. List of deceased patients ! Mr. Smith, who lived at 11 Sunny St. Miss White, who lived at 23 Cloudy Rd. Mr. Jones, who lived at 30 Rainy St. Mrs. Robinson, who lived at 34 Windy Rd. … data
  14. John Snow
  15. What is it good for? Storytelling Communicate known information Exploratory data analysis Explore data to reveal insights
  16. More powerful Visualization = Visual display + Interaction
  17. 3. How is it related to data science?
  18. Turn data into valuable insights data product interesting stories
  19. data wrangling output insights, products, stories exploratory data analysis report results raw data in-depth analysis
  20. data wrangling output insights, products, stories exploratory data analysis report results in-depth analysis communication, storytelling raw data
  21. 4. Example projects
  22. 4.1 Ballon d’Or
  23. FIFA released voting data
  24. • 3 voters / country • National team captain • National team coach • Journalist (media) • Each voter select 3 players for 1st, 2nd and 3rd place Rules
  25. data wrangling output insights, products, stories exploratory data analysis report results in-depth analysis communication, storytelling raw data
  26. • Given data are tables in PDF. • Extract to csv • Format data to desired format. Data Wrangling
  27. Demo / Ballon d’Or https://medium.com/@kristw/who-voted-for-who-diving-into-ballon-dor-voting-data-e09138ba9712
  28. 4.2 Public-facing vis & New year 2013
  29. interactive.twitter.com
  30. Geo Heatmap Low density High density
  31. Geo San Francisco flickr.com/photos/twitteroffice/8798020541 Low density High density
  32. Geo San Francisco Rebuild the world based on tweet volumes twitter.github.io/interactive/andes/
  33. How are these phrases used in Tweets? Is there any pattern?
  34. data wrangling output insights, products, stories exploratory data analysis report results in-depth analysis communication, storytelling raw data
  35. Big data wrangling
  36. Having all Tweets How people think I feel.
  37. How people think I feel. How I really feel. Having all Tweets
  38. • Too much data, want only relevant Tweets • contain “สวัสดีปีใหม่” • variations: หวัดดีปีใหม่, หวัดดีปีหม่ายยย • typos: หวัดตีปีใหม่ • Need to aggregate & reduce size • Long processing time (hours) Challenges
  39. Hadoop Cluster Data Storage Workflow
  40. Hadoop Cluster Pig / Hive / Scalding (slow) Data Storage Tool Workflow
  41. Hadoop Cluster Pig / Hive / Scalding (slow) Data Storage Tool Workflow
  42. Hadoop Cluster Pig / Hive / Scalding (slow) Data Storage Tool Smaller datasetYour laptop Workflow
  43. Hadoop Cluster Pig / Hive / Scalding (slow) Data Storage Tool Final dataset Tool node.js / python / etc. (fast) Your laptop Workflow Smaller dataset
  44. Exploratory Data Analysis
  45. Improve design for releasing to public
  46. Demo / New Year 2013 twitter.github.io/interactive/newyear2014/
  47. Another fun fact: Developed using 2012 data Then update data on Jan 2, 2013
  48. 4.3 Data Analysis Tool
  49. data wrangling output insights, products, stories exploratory data analysis report results in-depth analysis communication, storytelling raw data
  50. Logging user activities
  51. UsersUseTwitter
  52. UsersUse Product Managers Curious Twitter
  53. UsersUse Curious Engineers Log data in Hadoop Write Twitter Instrument Product Managers
  54. What are being logged? tweet activities
  55. What are being logged? tweet from home timeline on twitter.com tweet from search page on iPhone activities
  56. What are being logged? tweet from home timeline on twitter.com tweet from search page on iPhone sign up log in retweet etc. activities
  57. Organize?
  58. log event a.k.a. “client event” [Lee et al. 2012]
  59. log event a.k.a. “client event” client : page : section : component : element : action web : home : timeline : tweet_box : button : tweet 1) User ID 2) Timestamp 3) Event name 4) Event detail [Lee et al. 2012]
  60. Twitter for Banana
  61. Count page visits banana : home : - : - : - : impression home page
  62. User sessions Session#1 A B start end Session#4 start end A Session#2 B start end A Session#3 C start end Aclient event client event
  63. Funnel home page profile page
  64. Funnel analysis banana : home : - : - : - : impression banana : profile : - : - : - : impression 1 jobhome page profile page 1 hour
  65. Funnel analysis banana : home : - : - : - : impression banana : profile : - : - : - : impression banana : search : - : - : - : impression home page profile page search page 2 jobs 2 hours
  66. Funnel analysis banana : home : - : - : - : impression banana : profile : - : - : - : impression banana : search : - : - : - : impression home page profile page search page Specify all funnels manually! n jobs n hours
  67. Goal banana : home : - : - : - : impression … …… 1 job => all funnels, visualized home page
  68. User sessions Session#1 A B start end Session#4 start end A Session#2 B start end A Session#3 C start end A
  69. Aggregate 4 sessions A BB C start end endend A A end A
  70. Aggregate A BB C start end endend end 4 sessions
  71. Aggregate C start end endend end A B 4 sessions
  72. Aggregate C start end endend end A B 4 sessions
  73. Aggregate C start end endend A B end 4 sessions
  74. Aggregate C start endend A B end 4 sessions
  75. Aggregate C start endend A B end 4 sessions
  76. Aggregate start endend A CB end 4 sessions
  77. Aggregate 4,000,000 sessions endend A CB end start
  78. Demo / Flying Sessions Using Visualizations to Monitor Changes and Harvest Insights from a Global-Scale Logging Infrastructure at Twitter by Krist Wongsuphasawat and Jimmy Lin. in Proc. IEEE Conference on Visual Analytics Science and Technology (VAST), Paris, France, 13 November, 2014
  79. visualizationdata What is it about? Data => Visual display + Interaction What is it good for? Exploratory data analysis & storytelling How is it related to data science? It is one of the skills often utilized in the process. Example projects interactive.twitter.com @kristw / kristw.yellowpigz.com
  80. Thank you
  81. Questions?

×