Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 24

Three Stats Pitfalls Facing the New Data Scientist with Sean Owen

0

Share

Download to read offline

Careful with that modeling tool! Even the simplest data analysis problems can have surprising statistical subtleties, which can lead the aspiring data scientist to the wrong conclusions from data. This talk will examine three straightforward scenarios where many answers seem correct. It will examine how the notion of causality helps resolve all of them, and briefly explore the power of graphical models and Judea Pearl’s do-calculus.

By the end of this session, you will be more cautious and careful with the modeling tool, and learn that correlation is not always causation.

Three Stats Pitfalls Facing the New Data Scientist with Sean Owen

  1. 1. Three Stats Pitfalls Facing 
 the New Data Scientist Sean Owen
  2. 2. 2 Do I Know You? • Apache Spark committer, PMC
 • “Advanced Analytics with Spark”
 • Recently: Director, Data Science @ Cloudera
  3. 3. 3 Correlation is not causation. But then, what is causation?
  4. 4. 4 Which Best-Fit Line Is Best? lm(speed ~ dist, cars) lm(dist ~ speed, cars)
  5. 5. 5 Which Treatment is Better? Treatment A Treatment B Small Stones 93% (81/87) 87% (234/270) Large Stones 73% (192/263) 69% (55/80) 78% (273/350) 83% (289/350)
  6. 6. 6 Now, Which Treatment is Better? Treatment A Treatment B Low Blood pH 93% (81/87) 87% (234/270) High Blood pH 73% (192/263) 69% (55/80) 78% (273/350) 83% (289/350)
  7. 7. 7 Carburetors and Axle Ratio Uncorrelated lm(drat ~ carb, mtcars) r = -0.09
  8. 8. 8 … Except in Both Halves of the Data? lm(drat ~ carb, mtcars[
 which(mtcars$cyl >= 6),]) lm(drat ~ carb, mtcars[
 which(mtcars$cyl < 6),]) r = 0.52 r = 0.22
  9. 9. 9 3 Answers
  10. 10. Resolution: Causation • Humans reason causally • Data doesn’t contain causal information • Data correlations consistent with multiple causal models • Correct inference requires adding causal model 10
  11. 11. 11 One Consistent with Causal Knowledge disti = β0 + β1speedi + ϵi
  12. 12. 12 DistanceSpeed
  13. 13. 13 Controlling Confounders is Right Treatment A Treatment B Small Stones 93% (81/87) 87% (234/270) Large Stones 73% (192/263) 69% (55/80) 78% (273/350) 83% (289/350)
  14. 14. 14 OutcomeTreatment Size
  15. 15. 15 Controlling Mediators is Wrong Treatment A Treatment B Low Blood pH 93% (81/87) 87% (234/270) High Blood pH 73% (192/263) 69% (55/80) 78% (273/350) 83% (289/350)
  16. 16. 16 OutcomeTreatment pH
  17. 17. 17 Colliders Create Correlation lm(drat ~ carb, mtcars) r = -0.09
  18. 18. 18 CarburetorsAxle Ratio Cylinders
  19. 19. 19 Causation and do-Calculus
  20. 20. 20 do-Calculus P(Y|X) ≠ P(Y|do(X)) Just because it’s more often raining when you
 walk outside with an umbrella …
 … doesn’t mean that you carrying an umbrella 
 makes it more likely to be raining.
  21. 21. 21 CancerSmoking Genes? Tar
  22. 22. 22 P(C|do(S)) = ∑ t P(C|do(S), t)P(t|do(S)) = ∑ t P(C|do(S), do(t))P(t|do(S)) = ∑ t P(C|do(S), do(t))P(t|S) = ∑ t P(C|do(t))P(t|S) = ∑ s′ ∑ t P(C|do(t), s′)P(s′|do(t))P(t|S) = ∑ s′ ∑ t P(C|t, s′)P(s′|do(t))P(t|S) = ∑ s′ ∑ t P(C|t, s′)P(s′)P(t|S)
  23. 23. Conclusion • Must bring causal info to data for proper interpretation • Know common causal pitfalls! • PGMs help reason about causal effects • Do-calculus can clarify reasoning about intervention 23
  24. 24. Thank You @sean_r_owen

×