The document discusses common problems that can arise when analyzing data from news stories. It examines two fictional news stories and identifies potential issues with how the data was collected and presented, including small sample sizes, selection bias, outliers skewing averages, and lack of context. The key takeaways are to consider how representative samples are, watch out for outliers, understand the sampling method, and make sure to provide appropriate context for data.
3. Story #1: Best-Burgers moves up-market
Dissect the data in this story:
Best-Burgers Attracts Upper-Income Diners
San Francisco – November 9, 2014 – It’s no secret that
Best-Burgers has been courting upper-income diners, and it looks
like their campaign is working. At lunch yesterday, I visited a
Best-Burgers near our downtown office and chatted with customers
enjoying the daily special: bountiful lobster salads with earthy
pommes frites, paired with a perfect Pouilly-Fum´e.
From my 14 conversations with these happy diners, the average
yearly income was $164k, far above the old stereotype of
budget-conscious fast-food customers.
4. Here are some problems with the Best-Burgers story
Tiny sample Only 14 customers
chance greatly affects the average
Sample bias Downtown San Francisco at lunchtime does
not represent USA.
Selection bias Journalist picked lobster eaters
Interviewer bias Journalist may have coached participants:
You look like an upper-income customer,
may I ask you a quick question?
Response bias Low-income customers might be embarrassed
and not answer.
5. Here are some problems with the Best-Burgers story
Tiny sample Only 14 customers
chance greatly affects the average
Sample bias Downtown San Francisco at lunchtime does
not represent USA.
Selection bias Journalist picked lobster eaters
Interviewer bias Journalist may have coached participants:
You look like an upper-income customer,
may I ask you a quick question?
Response bias Low-income customers might be embarrassed
and not answer.
6. Here are some problems with the Best-Burgers story
Tiny sample Only 14 customers
chance greatly affects the average
Sample bias Downtown San Francisco at lunchtime does
not represent USA.
Selection bias Journalist picked lobster eaters
Interviewer bias Journalist may have coached participants:
You look like an upper-income customer,
may I ask you a quick question?
Response bias Low-income customers might be embarrassed
and not answer.
7. Here are some problems with the Best-Burgers story
Tiny sample Only 14 customers
chance greatly affects the average
Sample bias Downtown San Francisco at lunchtime does
not represent USA.
Selection bias Journalist picked lobster eaters
Interviewer bias Journalist may have coached participants:
You look like an upper-income customer,
may I ask you a quick question?
Response bias Low-income customers might be embarrassed
and not answer.
8. Here are some problems with the Best-Burgers story
Tiny sample Only 14 customers
chance greatly affects the average
Sample bias Downtown San Francisco at lunchtime does
not represent USA.
Selection bias Journalist picked lobster eaters
Interviewer bias Journalist may have coached participants:
You look like an upper-income customer,
may I ask you a quick question?
Response bias Low-income customers might be embarrassed
and not answer.
9. Here are counties with the lowest cancer rates
Propose a hypothesis
Wainer, H, et al. Phi Delta Kappan, 300–303, 2006
10. Check this out: Counties with highest cancer rates
What’s going on?
Wainer, H, et al. Phi Delta Kappan, 300–303, 2006
11. Small samples produce high variance
FIGURE 3.Age-adjustedcancerrate(perhundredthousand)
20-
15-
10-
5-
0-
100 1,000 10,000 100,000 1,000,000 10,000,000
Population
Wainer, H, et al. Phi Delta Kappan, 300–303, 2006
12. Story #2: Stock portfolios are doing great
Dissect the data in this story:
No Sad Faces as Dow Smashes Record
New York – November 9, 2014 – After Friday’s record stock
market close, analysis of 5000 random investor accounts found that
the average account balance worth was over $10 million. “Never
before have so many people made so much money,” beamed a
jubilant Ann Smith as crisp $100 bills spilled out of her pockets.
13. Simple histogram reveals the underlying data
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
$0 $10 $20 $30 $40 $50
Account value in billions of dollars
Numberofinvestors
Average Balance = $10,000,000
What could cause
this data?
14. Outliers skewed average to $10 million
Most account balances are small
One is huge
Average balance =
total of all account balances
5000 accounts
= $10 million
Outlier points are either:
Correct but unusual data
Bad data (errors, typos very common)
Takeaway: Outliers skew results
Takeaway: Always look for outliers
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
$0 $10 $20 $30 $40 $50
Account value in billions of dollars
Numberofinvestors
18. Median finds the middle item
1. Rank the account balances from smallest to biggest
2. Pick the middle position
3. This is the median
4. Median much less sensitive
to outliers than average
Rank Balance
1 $0
2 $14
3 $241
...
...
→ 2500 → $50,251
...
...
4998 $341,032
4999 $965,864
5000 $50,231,754,642
Takeaway: Median tolerates outliers
19. Median finds the middle item
1. Rank the account balances from smallest to biggest
2. Pick the middle position
3. This is the median
4. Median much less sensitive
to outliers than average
Rank Balance
1 $0
2 $14
3 $241
...
...
→ 2500 → $50,251
...
...
4998 $341,032
4999 $965,864
5000 $50,231,754,642
Takeaway: Median tolerates outliers
20. Story #3: Refrigerator prices in deep freeze
Dissect the data in this story:
Refrigerator Prices Stuck in Deep Freeze
Chicago – November 9, 2014 – Median refrigerator prices have
been flat for the past ten years, despite a flood of new high-end
products with luxury styling, celebrity endorsements, and
high-efficiency green technology.
What are some possibilities here?
21. Median condenses complex data into single number
Median = 808
Median = 808
0
100
200
300
400
500
0
100
200
300
400
500
10yearsagocurrentyear
0 1000 2000 3000 4000 5000
Unit price (dollars)
Refrigeratorssold
22. Graphing told much more of a story than numbers
Takeaway: Summary statistics often hide interesting data
We’ve seen limitations with:
average (mean)
median
You’ll see limitations with other summary statistics:
standard deviation
correlation
regression
Takeaway: Graphing tells a much better story than numbers
23. Story #4: Taller children read better
Dissect the data in this story:
Lanky Bookworms in Spotlight
Washington – November 9, 2014 – The U.S. Department of
Education reported yesterday that reading comprehension for
students in grades 3–8 dramatically corresponded with the
students’ heights.
28. Why is reading score related to height?
Age Observed
Not observed Reading
Height
causes
causes
29. Why is reading score related to height?
Age Observed
Not observed Reading
Height
causes
causes
Takeaway: Non-observed factors are common.
Always look for underlying causes
30. We also measure correlation (r) between variables
1 0.8 0.4 0 -0.4 -0.8 -1
1 1 1 -1 -1 -1
0 0 0 0 0 0 0
Correlation measures strength of linear relationship:
+1 Perfectly correlated (rare)
Example: Height in inches & Height in cm
-1 Perfectly inversely correlated (rare)
Example: Hours sleeping & Hours awake
0 Non-correlated – no relationship
Example: Favorite food & Purchases of postage
stamps
−1 < r < +1 Common – some relationship Image credit: wikipedia.org
32. Correlation does not imply causality
35
30
25
20
10
5
15
0
0 5 10 15
Chocolate Consumption (kg/yr/capita)
NobelLaureatesper10MillionPopulation
Poland
Switzerland
Sweden
Norway
China Brazil
GreecePortugal
United
States
Germany
France
Finland
Italy
Australia
TheNetherlands
Canada
Belgium
United Kingdom
Ireland
Spain
Austria
Denmark
r=0.791
P<0.0001
Japan
Messerli, FH. N Engl J Med, 367(16):1562, 2012
33. Big Data produces spurious correlations
Marriage rate correlates with electrocutions
24,000 automatically discovered correlations at http://www.tylervigen.com/
34. Big Data produces spurious correlations
Marijuana arrests inversely correlate with honey bee population
24,000 automatically discovered correlations at http://www.tylervigen.com/
35. Big Data produces spurious correlations
Marijuana arrests inversely correlate with honey bee population
Takeaway: Correlation does not imply causality
24,000 automatically discovered correlations at http://www.tylervigen.com/
39. Story #5: Happy colors make happy patients
Dissect the data in this story:
Bright colors cheer up hospital patients
Topeka – November 9, 2014 – In a groundbreaking experiment,
Central Hospital has shown that warm, happy colors improve
patients’ moods.
Using two identical general medicine wards, researchers splashed
one with bright perky colors, and slathered the other in a viscous,
dreary, Soviet-era gray. One month later, Dr. Vargas interviewed
100 patients exposed to bright colors, while Dr. Mira interviewed
100 patients surrounded in gloom.
The patients exposed to bright colors were 68% happier than those
from the other ward.
40. Find some possible biases here?
Vargas
Mira
Bright
paint
Patients
Gloomy
paint
Patients
41. Story #6: Marketing manger sues firm
Dissect the data in this story:
Fired sales manager James Smith demands compensation
Cambridge, MA – November 9, 2014 – James Smith argued in
Federal Court today that sales increased by 400% while he led the
International Marketing Division, and that he should have been
rewarded rather than terminated.
“Increasing sales by 400% is way beyond superstar performance,”
roared his attorney.
42. Relative change hides quantity
Sales increased 400% =
sales this year – last year
last year
Sales increased 400% =
5 – 1
1
Sales increased 400% =
5,000,000 – 1,000,000
1,000,000
43. True story: Contraceptive Pill Scare of 1995
U.K. Committee on Safety of Medicines (1995):
Old contraceptive: 1/7,000 had severe blood clot
New contraceptive: 2/7,000 had severe blood clot
“New drug
doubles risk”
Patients
abandoned drug
Takeaway: Relative change hides quantity
Gigerenzer, G, et al. Psychological science in the public interest, 8(2):53, 2007
44. Recap: Visualization tells story better than numbers
All: ¯y = 7.5, S = 2, r = 0.82
Anscombe, FJ. The American Statistician, 27(1):17, 1973
45. We can visualize 3-D and 4-D datasets
Extend to 5-D and 6-D:
Point size:
Point shape: + G X
http://www.advsofteng.com/doc/cdperldoc/threedscatter.htm
47. Visualize and compare numeric data by category
Volkswagen
Toyota
Subaru
Pontiac
Nissan
Mercury
Lincoln
Land rover
Jeep
Hyundai
Honda
Ford
Dodge
Chevrolet
Audi
0 10 20 30
Highway mileage
Manufacturer
Takeaway: Alphabetic ordering obscures story
48. Visualize and compare numeric data by category
Volkswagen
Toyota
Subaru
Pontiac
Nissan
Mercury
Lincoln
Land rover
Jeep
Hyundai
Honda
Ford
Dodge
Chevrolet
Audi
0 10 20 30
Highway mileage
Manufacturer
Takeaway: Alphabetic ordering obscures story
49. Reordering & simplifying greatly clarifies the story
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Land rover
Lincoln
Jeep
Dodge
Mercury
Ford
Chevrolet
Nissan
Toyota
Subaru
Pontiac
Audi
Hyundai
Volkswagen
Honda
20 25 30
Highway mileage
Manufacturer
Takeaway: Small visualization changes add great clarity to a story
50. Reordering & simplifying greatly clarifies the story
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Land rover
Lincoln
Jeep
Dodge
Mercury
Ford
Chevrolet
Nissan
Toyota
Subaru
Pontiac
Audi
Hyundai
Volkswagen
Honda
20 25 30
Highway mileage
Manufacturer
Takeaway: Small visualization changes add great clarity to a story
51. Visualize and compare histograms by category
0
200
400
600
0
30
60
90
120
Cats(1000)Dogs(1000)
0 5 10 15 20
Number of tricks
Numberofpets
52. Visualized cross-tabulated data
Student Admissions at UC Berkeley in 1973
Gender Admitted Rejected
Male 1198 1493
Female 557 1278
Admitted RejectedMaleFemale
53. Let’s summarize
Our broad philosophy:
Always think carefully about data (brain software)
Always explore data
Visualizing data is extremely valuable
Data often contains noise and bias
Summary statistics (mean, median, correlation, . . . )
obscure important details
Correlation does not imply cause
Big Data increases spurious correlations