9. WHEN GOOGLE GOT IT BIG WRONG
Google Flu was a tool supposed to predict outbursts of Influenza-like illnesses (ILI)
The parable of Google Flu: traps in Big Data analysis, Science (2014)
CDC: centre for Disease Control
It used search queries on Google
10. WHAT WAS THE PROBLEM
Two main things:
1. Use of spurious search queries data
• queries correlated with Flu outbreaks but not a predictor
• lack of appropriate statistical analysis for biases & patterns
2. Dependency on the search algorithm inner workings
• recommended queries - inflating counts
• dynamical changes to the search algorithm, changing experiment
conditions
11. A MASTERPIECE OF MANIPULATION
Plot by S Goddard (T Heller), a climate
change denialist,
(who blogs at realclimatescience.com)
The data is manipulated:
• not the whole timeline is represented
• doesn’t un-bias for changes in the
weather stations
• picks one season and uses only high
temperatures
USA temperatures, can I sucker you?, Open Mind, Tamino (blog post)
13. THE CONFIRMATION BIAS
This is quite common, and easy to spot
It occurs when you see in data what you want to see/what you believe
X crimes committed by
immigrants
Immigration increases
crime rates
(no stats on non-immigrants considered)
14. THE SIMPSON’s PARADOX
TOTALS Me My friend
WON 10 5
LOST 7 6
Winning % 58.8% 45.5%
MONTHLY
Me,
October
My friend,
October
Me,
November
My Friend,
November
WON 4 3 6 2
LOST 6 4 1 2
Winning % 40% 42.9% 85.7% 100%
It occurs when you see different results when the data is
comprehensive or split into attribute groups
18. • inspect your data
• look for your biases
• be strong against your desires
• if there’s correlation, don’t jump the gun
• …
If it’s your work
If it’s someone else’s work (e.g., the media)
• look for the sources, be judgmental
• evaluate if data has been cherry-picked
• question the interests behind
• …
19. “The key word in “Data Science” is not
Data, it is Science.
-Jeff Leek
21. SOME GOOD REFERENCES
• D Huff, How to lie with Statistics, W W Norton & Company (1954)
• D Lazer, R Kennedy, G King, A Vespignani, The parable of Google Flu: traps in Big Data analysis, Science 343:6176 (2014)
• R Botsman, Big Data meets Big Brother as China moves to rate its citizens, WIRED (2017)
• C Bergstrom, J West, Calling Bullshit in the age of Big Data (course/videos/material)
• J Leek, B B McShane, A Gelman, D Colquhoun, M B Nuijten, S N Goodman, Five ways to fix Statistics, Nature 551 (2017)
• G Lewis-Kraus, The great AI Awakening, The New York Times (2016)
• H Fry, What Statistics can and can’t tell us about ourselves, The New Yorker (2019)
• P J Bickel, E A Hammel, J W O’Connell, Sex bias in graduate admissions: data from Berkley, Science, 187:4175 (1975)
• T Harford, More or Less: behind the statistics (podcast, BBC Radio 4)
• Tyler Vigen, Spurious Correlations (website)
• TED - The best Hans Rosling’s talks you’ve ever seen
• USA temperature: can I sucker you? (blog post, Open Mind)