Upcoming SlideShare
×

Lies, damned lies & dataviz

2,094 views
1,967 views

Published on

Bad visualization, and how to avoid it

5 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
2,094
On SlideShare
0
From Embeds
0
Number of Embeds
99
Actions
Shares
0
20
0
Likes
5
Embeds 0
No embeds

No notes for slide

Lies, damned lies & dataviz

1. 1. Lies, Damned Lies & Dataviz Bad visualization, and how to avoid it Dr. Andrew Clegg Director, Learner Analytics & Data Science Pearson @andrew_clegg
2. 2. Part I — Why Visualize? What are the benefits — when it’s done right? Part II — Bad Dataviz How to spot the failures — and how to avoid them yourself Warning: Contains Opinion! Introduction
3. 3. Part I — Why Visualize?
4. 4. ● Summarizing and communicating numbers ● Drawing attention to trends and patterns ● Exploring data interactively ● Capturing attention ● Telling stories What is the goal?
5. 5. Playing to your neural hardware’s strengths Your visual system excels at pattern detection & parallel processing. Representing data graphically means you can leverage this “for free”. How does visualization help?
6. 6. Challenge: estimate x when y = 0 x y x y x y 27.38 24.05 32.31 31.61 75.67 14.83 62.64 7.31 51.84 28.61 34.23 31.65 50.76 16.30 59.04 18.29 51.21 7.69 42.94 26.78 74.63 1.15 47.26 22.90 8.72 42.35 56.15 11.37 66.60 3.21 30.62 30.87 47.23 19.49 17.46 40.31 62.63 9.14 59.36 8.82 65.70 12.79 63.21 18.66 44.58 19.12 52.24 12.92 40.49 23.29 47.85 20.55 62.56 14.17 22.07 41.46 68.21 11.99 40.43 19.77
7. 7. Challenge: estimate x when y = 0
8. 8. Challenge: estimate x when y = 0
9. 9. Challenge: find most similar data point x y x y x y 54.88 71.52 97.86 79.92 35.95 43.70 60.28 54.49 46.15 78.05 69.76 6.02 42.37 64.59 11.83 63.99 66.68 67.06 43.76 89.18 14.34 94.47 21.04 12.89 96.37 38.34 52.18 41.47 31.54 36.37 79.17 52.89 26.46 77.42 57.02 43.86 56.80 92.56 45.62 56.84 98.84 10.20 7.10 8.71 1.88 61.76 20.89 16.13 2.02 83.26 61.21 61.69 65.31 25.33 77.82 87.00 94.37 68.18 46.63 24.44
10. 10. Challenge: find most similar data point
11. 11. Challenge: find the outlier x y x y x y 54.88 71.52 97.86 79.92 35.95 43.70 60.28 54.49 46.15 78.05 69.76 6.02 42.37 64.59 11.83 63.99 66.68 67.06 43.76 89.18 14.34 94.47 21.04 12.89 96.37 38.34 52.18 41.47 31.54 36.37 79.17 52.89 26.46 77.42 57.02 43.86 56.80 92.56 45.62 56.84 98.84 10.20 7.10 8.71 1.88 61.76 20.89 16.13 2.02 83.26 61.21 61.69 65.31 25.33 77.82 87.00 94.37 68.18 46.63 24.44
12. 12. Challenge: find the outlier
13. 13. Avoiding limitations of statistics Showing patterns in large data sets with minimal information loss. Revealing structure of “tricky” data sets where typical summary statistics do a poor job. How does visualization help?
15. 15. Describing statistically tricky data http://www.stanford.edu/~mwaskom/software/seaborn/examples/anscombes_quartet.html All four have the same: mean(x) variance(x) mean(y) variance(y) correlation coefficient regression coefficients Anscombe’s Quartet (Francis Anscombe, 1973)
16. 16. Describing statistically tricky data Much web data, especially involving human preferences or choices, looks like this. There is no “central tendency” so typical descriptive statistics are useless. Zipfian distribution, an example of a power law.
17. 17. How does visualization help? Illustrating a story Visualizations are often used simply to clarify or reinforce the main points of a story, narrative or message. This process fails when the conclusions suggested by the graphic are irrelevant to the narrative, or even contradict it. It can also fail when the graphic has no clear message or multiple conflicting interpretations, or is largely incomprehensible. Many of the following examples illustrate these mistakes.
18. 18. Part II — Bad Dataviz
19. 19. 1. Axes of evil Bad dataviz http://fluffware.tumblr.com/post/4580822773/axes
20. 20. http://www.google.co.uk/trends/explore#q=%22data science%22 Unlabelled axes
21. 21. Firearms (skjutvapen) seizures report: http://bit.ly/1dHnFzC (PDF) via Junk Charts Axis scale manipulation
22. 22. https://twitter.com/jk_keller/status/410498080765919232/photo/1 Axis scale manipulation (totally shameless version) Version published by Reuters Version “fixed” by @jk_keller
23. 23. Example from Stephen Few (PDF) Dual axes: caution Natural interpretation: Units sold “dipped below” revenue (A) and is now “catching up” (B). But these impressions are meaningless. They are just artefacts of the chosen axis scales. A B
24. 24. Proportionality errors From an Australian document found at The Guardian 1 row of people = roughly 43,000 nurses. 10 rows = roughly 48,000 nurses. ?!?
25. 25. Cheating outright? All found via The Guardian
26. 26. Quick quiz: what happened in 2005?
27. 27. Axis inversion: when “down” means “up”?!? From Thomson Reuters via Business Insider Version published by Reuters Version “fixed” by @PFedewa
29. 29. Pie charts: avoid Bad Colours used for separating slices, so can’t easily be put to another use. No way to show time dimension statically. Comparing relative sizes of slices is hard. Doing it in 3D is harder. Perspective inflates nearer slices, and the similar volume of the objects is a red herring. Doing it with deep, discontinuous 3D objects is even harder. Worse Worst
30. 30. Perhaps justifiable (in 2D) if numbers are sufficiently different. Otherwise, use a much simpler design and avoid all those problems. Pie charts: avoid
31. 31. Pie chart horrors http://junkcharts.typepad.com/junk_charts/2014/03/two-charts-that-fail-self-sufficiency.html Pie charts are supposed to show proportions of a whole. People expect the %s to add up to 100%. This one shows proportions of separate quantities.
32. 32. Pie chart horrors From a World Bank report (PDF) found at The Guardian These ones show 96% and 40% as full circles. This one is falling apart. This one thinks 76% is less than three quarters.
33. 33. Even worse uses of 3D https://www.tableausoftware.com/public/blog/2011/01/viz-wiz-1-11 and http://www.simplexnumerica.com/Gallery/gallery_pyramid.html Cones, pyramids, spheres etc… Are we comparing width, height, area or volume? Nobody knows! 26.76% = tiny peak 23.32% = massive slab ?!?
34. 34. Stacked charts: caution Stacked charts show how a data series breaks down by another attribute of the data. But people often misread these as two distinct data series, reading off a separate y-axis value for each one.
35. 35. Bubble charts: avoid http://commons.wikimedia.org/wiki/File:Bubble_Chart_Chicago_Deposit_Market.jpg
36. 36. Bubble charts: avoid http://bit.ly/1okS3nE and http://bit.ly/1hdZQtO
38. 38. Non-normalized quantities are useless http://personal.frostburg.edu/jibandy0/starbucks%20map.jpg Don’t use absolute values without a very good reason. Normalize appropriately: per capita, per adult, per student, per household, per square km, per journey, per voter …
39. 39. Remember: geopolitical boundaries are artificial This map shows all the countries I’ve visited. The relative size of USA makes me seem much more widely travelled than I really am. Is “country” the right level of aggregation?
40. 40. Remember: map projections lie http://en.wikipedia.org/wiki/File:Hobo%E2%80%93Dyer_projection_SW.jpg
41. 41. http://en.wikipedia.org/wiki/File:Choropleth-density.png
42. 42. Consider using fixed-size bins http://bit.ly/O9EPta
43. 43. Drawbacks of maps ● Can’t easily show time dimension, without animation ● Hard to show multiple attributes of data at once ● Physical proximity can obscure demographic/cultural differences, and vice versa Just because you can map the data, doesn’t mean you should. Save maps for when geographical trends are the key focus.
44. 44. 4. Colour choice Bad dataviz Good colour palettes from RColorBrewer
45. 45. Sequential data http://mapsdeguap.blogspot.co.uk/2012/04/choropleth-map.html Use a smooth transition from min to max. Don’t “cycle” more than once. This map goes purple- green twice. A better choice would be:
46. 46. Diverging data http://www-03.ibm.com/press/us/en/pressrelease/35359.wss Here the yellow section indicates the median. Red/green = above/below median. However, the red and green ranges are not scaled well. 75 (close to median) is almost the same colour as 108 (max). Sequential data, but with a well-defined midpoint. Two directions from this midpoint -- two poles: above/below average, positive/negative, female/male, Democrat/Republican etc.
47. 47. Categorical data Also known as nominal or qualitative. Colours should not form a pattern, as this can imply a false relationship. The ethnicity colours here are reasonable, although quite close in colour space. The location colours are badly chosen. They suggest a linear progression, which is meaningless. http://www.visualizing.org/full-screen/10886
48. 48. Consistency Don’t do this. http://www.raterush.com/pages/digg-reddit
49. 49. Other considerations ● Colour blindness -- nearly 10% of men -- rare in women ● Print and photocopy friendliness ● Characteristics of different screens, esp. projectors ColorBrewer is a great help: See also… ● brewer2mpl (Python) ● RColorBrewer (R) ● ColorBrewer (Matlab) http://colorbrewer2.org/
50. 50. Bad dataviz 5. Correlation vs. causation http://xkcd.com/552/
51. 51. Beware of bogus correlations http://gizmodo.com/5977989/internet-explorer-vs-murder-rate-will-be-your-favorite-chart-today/ and http://pubs.acs.org/doi/abs/10.1021/ci700332k Correlation does not prove causation, even with a good R2 score.
52. 52. Beware of bogus correlations Even respectable journals sometimes get carried away. Ask yourself: Are these both effects of a common cause? Or just sheer chance? (Multiple comparisons) http://www.nejm.org/doi/full/10.1056/NEJMon1211064
53. 53. Bad dataviz 6. Trying to say too much Each visualization needs a clear purpose. But some designers and analysts try to include every possible piece of information. This is not a good idea. Unnecessary detail and ostentatiously “clever” presentation can obscure the real message.
54. 54. http://bit.ly/1gVzeUe Don’t do this.
55. 55. 7. Tips for developing a critical eye Here are some techniques you can use for critical analysis. They are often subjective, debatable, context-dependent and partly based on aesthetics… So don’t expect absolute rules. Bad dataviz
56. 56. Usability Does the chart need detailed instructions in order for it to be comprehensible and usable? ● Acceptable if this is a standard visualization method used in a particular domain ● Less acceptable if this is a one-off for general consumption
57. 57. First impressions test What is the first thing you infer from looking at the visualization? (Don’t stop to read every detail -- see what you get from a glance.) Does this impression prove to be accurate, on closer inspection? If not, then there may be a problem. Many people will only glance and never perform the close inspection.
58. 58. Return on effort (Kaiser Fung) http://bit.ly/1dKewGo
59. 59. Self-sufficiency test (Kaiser Fung) Would the chart make sense without the numbers printed on each data point? If not, the chart has failed the self-sufficiency test. http://junkcharts.typepad.com/junk_charts/2013/03/blowing-the-whistle-at-bubble-charts.html
60. 60. Trifecta checkup (Kaiser Fung) Ask the following: ● What practical question does the graphic attempt to address? ● What answer does the data imply? ● What answer does the graphic imply? Can you answer these clearly? Do the three answers align? If not, there is something wrong. http://junkcharts.typepad.com/junk_charts/2014/02/pets-may-need-shelter-from-this-terrible-chart.html
61. 61. Data-ink score (Edward Tufte) Main principle: Remove redundant or uninformative elements from the design, to reduce distraction. High data-ink ratio = clarity. http://www.infovis-wiki.net/index.php/Data-Ink_Ratio
62. 62. And finally… Ask yourself how much you trust the data. Professional presentation does not imply reliable numbers. Is there enough data to be sure of statistical significance? What are the margins of error? Is there a plausible mechanism of action? What about sources of bias (accidental or intentional), confounding factors, missing data, or measurement error (noise)?
63. 63. Thank you! http://www.makefive.com/categories/entertainment/other/pie-charts-that-explain-simple-material/percentage-of-chart-which-resembles-pac-man