Successfully reported this slideshow.
Upcoming SlideShare
×
1 of 24

# Three Stats Pitfalls Facing the New Data Scientist with Sean Owen

0

Share

Careful with that modeling tool! Even the simplest data analysis problems can have surprising statistical subtleties, which can lead the aspiring data scientist to the wrong conclusions from data. This talk will examine three straightforward scenarios where many answers seem correct. It will examine how the notion of causality helps resolve all of them, and briefly explore the power of graphical models and Judea Pearl’s do-calculus.

By the end of this session, you will be more cautious and careful with the modeling tool, and learn that correlation is not always causation.

See all

### Three Stats Pitfalls Facing the New Data Scientist with Sean Owen

1. 1. Three Stats Pitfalls Facing   the New Data Scientist Sean Owen
2. 2. 2 Do I Know You? • Apache Spark committer, PMC  • “Advanced Analytics with Spark”  • Recently: Director, Data Science @ Cloudera
3. 3. 3 Correlation is not causation. But then, what is causation?
4. 4. 4 Which Best-Fit Line Is Best? lm(speed ~ dist, cars) lm(dist ~ speed, cars)
5. 5. 5 Which Treatment is Better? Treatment A Treatment B Small Stones 93% (81/87) 87% (234/270) Large Stones 73% (192/263) 69% (55/80) 78% (273/350) 83% (289/350)
6. 6. 6 Now, Which Treatment is Better? Treatment A Treatment B Low Blood pH 93% (81/87) 87% (234/270) High Blood pH 73% (192/263) 69% (55/80) 78% (273/350) 83% (289/350)
7. 7. 7 Carburetors and Axle Ratio Uncorrelated lm(drat ~ carb, mtcars) r = -0.09
8. 8. 8 … Except in Both Halves of the Data? lm(drat ~ carb, mtcars[  which(mtcars\$cyl >= 6),]) lm(drat ~ carb, mtcars[  which(mtcars\$cyl < 6),]) r = 0.52 r = 0.22
10. 10. Resolution: Causation • Humans reason causally • Data doesn’t contain causal information • Data correlations consistent with multiple causal models • Correct inference requires adding causal model 10
11. 11. 11 One Consistent with Causal Knowledge disti = β0 + β1speedi + ϵi
12. 12. 12 DistanceSpeed
13. 13. 13 Controlling Confounders is Right Treatment A Treatment B Small Stones 93% (81/87) 87% (234/270) Large Stones 73% (192/263) 69% (55/80) 78% (273/350) 83% (289/350)
14. 14. 14 OutcomeTreatment Size
15. 15. 15 Controlling Mediators is Wrong Treatment A Treatment B Low Blood pH 93% (81/87) 87% (234/270) High Blood pH 73% (192/263) 69% (55/80) 78% (273/350) 83% (289/350)
16. 16. 16 OutcomeTreatment pH
17. 17. 17 Colliders Create Correlation lm(drat ~ carb, mtcars) r = -0.09
18. 18. 18 CarburetorsAxle Ratio Cylinders
19. 19. 19 Causation and do-Calculus
20. 20. 20 do-Calculus P(Y|X) ≠ P(Y|do(X)) Just because it’s more often raining when you  walk outside with an umbrella …  … doesn’t mean that you carrying an umbrella   makes it more likely to be raining.
21. 21. 21 CancerSmoking Genes? Tar
22. 22. 22 P(C|do(S)) = ∑ t P(C|do(S), t)P(t|do(S)) = ∑ t P(C|do(S), do(t))P(t|do(S)) = ∑ t P(C|do(S), do(t))P(t|S) = ∑ t P(C|do(t))P(t|S) = ∑ s′ ∑ t P(C|do(t), s′)P(s′|do(t))P(t|S) = ∑ s′ ∑ t P(C|t, s′)P(s′|do(t))P(t|S) = ∑ s′ ∑ t P(C|t, s′)P(s′)P(t|S)
23. 23. Conclusion • Must bring causal info to data for proper interpretation • Know common causal pitfalls! • PGMs help reason about causal effects • Do-calculus can clarify reasoning about intervention 23
24. 24. Thank You @sean_r_owen