Visualize data using the split-apply-combine approach

1,087 views
748 views

Published on

An easy to understand primer about the "split-apply-combine" concept popularized by Hadley Wickham applied to data visualization. Following that I go through a simple introduction to the perceptual variables available for data visualization and some common mistakes.

Published in: Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,087
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Visualize data using the split-apply-combine approach

  1. 1. A Methodic Approach to Good Data Visualization Luca Candela - @luckymethod
  2. 2. Luca Candela DataPad Inc. // UX Eye // @luckymethod
  3. 3. Men of great rank, or active business, can only pay attention to particulars of use […] it is hoped that with the assistance of these Charts, information will be got, without the fatigue and trouble of studying the particulars [...] William Playfair - Commercial and Political Atlas, 1786
  4. 4. Data visualization is the art of *reducing information in a data set while preserving the knowledge contained in it. *we can talk about what “reducing information” means in this case...
  5. 5. Data Preparation Data Visualization Discovery of knowledge Conceptual data analysis workflow
  6. 6. Hadley Wickham popularized a concept called split-apply-combine as a way of thinking about data querying. http://www.jstatsoft.org/v40/i01/paper
  7. 7. For the four most revenue generating countries, what are the top three most revenue generating categories? Country Venue Type Sum Revenue United States Fast Food $16 Street $10 Restaurant $9 France Cafe $18 Pub $12 Restaurant $2 Canada Cafe $10 Fast Food $4 Street $3 Japan Street $5 Fast Food $4 Pub $1
  8. 8. apply: Sum Revenue Canada United States Germany France Japan split by country combine: sort descending by Sum Revenue, limit 4 Country Sum Revenue United States France Canada Japan $ 83 $ 42 $ 36 $ 18 data Sum Revenue = $ 36 Sum Revenue = $ 83 Sum Revenue = $ 8 Sum Revenue = $ 42 Sum Revenue = $ 18 The basics of split-apply-combine
  9. 9. Canada United States Germany France Japan data bus stop fastfood park ... restaurant hair saloon pub ... restaurant street cafe ... park pub street Country Sum Revenue United States France Canada Japan $ 16 $ 10 $ 9 $ 18 $ 12 $ 2 $ 10 $ 4 $ 3 $ 5 $ 4 $ 1 Venue type fastfood street restaurant cafe pub restaurant cafe fastfood park street fastfood pub ... The basics of split-apply-combine
  10. 10. Country Sum Revenue United States France Canada Japan split by country, combine by sorting desc. on Sum Revenue, map to the vertical axis using an ordinal scale. add labels apply: sum revenue, call it Sum Revenue, plot rectangles and map length to the horizontal axis using a linear scale, Color with #45808E. Use `Country` as label Split-apply-combine thinking translates to visualizations
  11. 11. 1. split on state apply sum population combine: sort desc. by population; limit 6 Nested split-apply-combine underpins more complex visualizations 2. split on age (bin by 5 year) combine: sort by age apply sum population
  12. 12. Data Visualization can be thought as a visual mapping function applied during the *Apply and Combine steps. *although it can be thought as applied exclusively during the combine step…
  13. 13. Name Operation Lines Vadim Added 100 Luca Removed 34 Vadim Added 65 Vadim Removed 5 Luca Added 24 Vadim Removed 71 Luca Removed 45 Vadim Added 7 ... ... ... -960 LucaVadim 1531 -321 739 0 1k 2k -2k -1k “plot” AdditionsDeletions Reduce information, preserve knowledge...
  14. 14. Question: Mapping of what, to what?
  15. 15. Types of data ID Timestamp Location Name Operation Lines Pass Test? 0000001 11-05-2013 10.45 am San Francisco Vadim Added 100 Yes 0000002 11-05-2013 11.12 am San Bruno Luca Removed 34 Yes 0000003 11-05-2013 11.30 am San Francisco Vadim Added 65 Yes 0000004 11-05-2013 11.34 am San Francisco Vadim Removed 5 Yes 0000005 11-05-2013 11.43 am San Bruno Luca Added 24 No 0000006 11-05-2013 11.45 am San Francisco Vadim Removed 71 Yes 0000007 11-05-2013 12.51 pm San Francisco Luca Removed 45 Yes 0000008 11-05-2013 12.55 pm San Francisco Vadim Added 7 No ... ... ... ... ... ... ... Categorical # Discrete # Continuous# Discrete Boolean
  16. 16. There are other ways to classify data, but this one will get you very far. pick up a good statistics book and just start reading...
  17. 17. Types of variables 1. Independent a. a variable that isn't changed by the other variables you are trying to measure. It usually goes on the x axis. 2. Dependent a. It is a variable that changes depending on other variable(s). It usually goes on the y axis.
  18. 18. -960 LucaVadim 1531 -321 739 0 1k 2k -2k -1k AdditionsDeletions Dependent Variable Independent Variable
  19. 19. Variables of a visualization 1. Position (x,y) 2. Size (big, small…) 3. Value (bright, dark…) 4. Texture (hatched, dotted…) 5. Color (blue, red…) 6. Orientation (degree) 7. Shape (triangle, circle…) y x
  20. 20. # Discrete # Continuous Categorical Boolean y x y x y x y x Optimal mappings by type
  21. 21. -960 LucaVadim 1531 -321 739 0 1k 2k -2k -1k AddedRemoved Name Operation Lines Vadim Added 100 Luca Removed 34 Vadim Added 65 Vadim Removed 5 Luca Added 24 Vadim Removed 71 Luca Removed 45 Vadim Added 7 ... ... ... Split on Name Split on Operation Apply Sum(Added) Apply Sum(Removed) Combine -Removed map to Red, value to size Combine Added map to Green, value to size Combine Name map to x axis
  22. 22. Apply the minimum number of mappings that illustrates the underlying question you are trying to answer.
  23. 23. Choosing the right viz...
  24. 24. 1. Label your axes 2. Include measurement units 3. Explain your encodings (add a legend) 4. Remove redundant information 5. Don’t fuck with distort the axis, especially with time series Golden rules - Part 1
  25. 25. Golden rules - Part 2 1. If you are trying to visualize rate of change, then do it 2. Remove outliers, but know they are there 3. Tools have their own biases and quirks, know them. 4. The solution to 80% of your problems are bar charts and histograms 5. Data Tables are visualizations too ...there are thousands of good rules, but the best one is still “keep it simple”
  26. 26. Some examples this is going to be fun...
  27. 27. Example 1 Simple bar chart Linear scale Missing bucket (4.8 - 4.9) Missing bucket (4.8 - 4.9)
  28. 28. Example 2
  29. 29. Example 2 - better No - Human Yes - Robot
  30. 30. Example 3
  31. 31. Example 4
  32. 32. Example 5 OK, this is comically bad, I was just going for a good collective giggle...
  33. 33. Books you should read everybody knows about Tufte, so please don’t bring it up
  34. 34. The Semiology of Graphics, 1967 Jaques Bertin
  35. 35. The Elements of Graphing Data, 1985 & Visualizing Data, 1993 William S. Cleveland
  36. 36. www.datapad.io
  37. 37. Thank you! for questions, tweet me at @luckymethod

×