why	
  you	
  should	
  care	
  about	
  sta6s6cs	
  
Jeff	
  Leek	
  
Johns	
  Hopkins	
  Bloomberg	
  Biosta6s6cs	
  
@le...
credits	
  
•  slides	
  shamelessly	
  borrowed	
  from:	
  
–  Ingo	
  Ruczinski	
  (Google:	
  “ingo’s	
  pond”)	
  
– ...
why	
  this	
  stuff	
  maNers	
  

@leekgroup	
  

@simplystats	
  
seems	
  like	
  an	
  exci6ng	
  result!	
  

@leekgroup	
  

hNp://www.nature.com/nm/journal/v12/n11/full/nm1491.html	
 ...
stunning	
  problems	
  

@leekgroup	
  

@simplystats	
  
how	
  it	
  went	
  down	
  

@leekgroup	
  
hNp://www.nature.com/news/2011/110111/full/
469139a/box/1.html	
  

@simplys...
s6ll	
  going	
  on	
  

@leekgroup	
  

@simplystats	
  
worth	
  a	
  watch	
  

@leekgroup	
  
hNp://www.birs.ca/events/2013/5-­‐day-­‐workshops/
13w5083/videos/watch/2013081411...
worth	
  a	
  read	
  

@leekgroup	
  
hNp://www.iom.edu/Reports/2012/Evolu6on-­‐
of-­‐Transla6onal-­‐Omics.aspx	
  

@sim...
what	
  were	
  the	
  problems?	
  
•  irreproducibility	
  
•  lack	
  of	
  coopera6on	
  

Transparency	
  

	
  
•  s...
6p	
  #1:	
  know	
  the	
  analysis	
  

@leekgroup	
  

hNp://bit.ly/OgW3xv	
  

@simplystats	
  
6p	
  #2:	
  care	
  about	
  the	
  analysis	
  

@leekgroup	
  

Drinkel et al. Oganometalics 2013

@simplystats	
  
6p	
  #3:	
  have	
  a	
  data/analysis	
  sharing	
  plan	
  

@leekgroup	
  
hNp://www.nature.com/nature/journal/v467/
n...
6p	
  #4:	
  know	
  where	
  to	
  get	
  help	
  

@leekgroup	
  

hNp://www.biostat.jhsph.edu/consult/	
  

@simplystat...
6p	
  #5:	
  no	
  subs6tute	
  for	
  the	
  real	
  thing	
  

@leekgroup	
  

@simplystats	
  
“central	
  dogma”	
  of	
  sta6s6cs	
  

@leekgroup	
  

Adapted	
  from	
  Josh	
  Akey	
  

@simplystats	
  
sample	
  size	
  

@leekgroup	
  

@simplystats	
  
some	
  experiment	
  

@leekgroup	
  

@simplystats	
  
example	
  calcula6ons	
  

@leekgroup	
  

@simplystats	
  
beNer	
  technology	
  ≠	
  no	
  variability	
  

@leekgroup	
  

hNp://www.nature.com/nbt/journal/v29/n7/full/nbt.1910.h...
power	
  

@leekgroup	
  

@simplystats	
  
bad	
  study	
  design	
  

78%	
  of	
  genes	
  differen6ally	
  
expressed	
  	
  
@leekgroup	
  

@simplystats	
  
group	
  and	
  date	
  “confounded”	
  

@leekgroup	
  

@simplystats	
  
uh-­‐oh!	
  

@leekgroup	
  

@simplystats	
  
confounding:	
  	
  

associa6on	
  between	
  shoe	
  size	
  and	
  literacy	
  in	
  kids	
  	
  
@leekgroup	
  

@simp...
proteomics	
  

@leekgroup	
  

@simplystats	
  
proteomics	
  

@leekgroup	
  

@simplystats	
  
gene	
  expression	
  

@leekgroup	
  

@simplystats	
  
gene	
  expression	
  

@leekgroup	
  

@simplystats	
  
gwas	
  

@leekgroup	
  

@simplystats	
  
gwas	
  

@leekgroup	
  

@simplystats	
  
confounding	
  is	
  a	
  big	
  deal	
  

@leekgroup	
  
hNp://www.nature.com/nrg/journal/v11/n10/full/
nrg2825.html	
  
...
confounding	
  and	
  study	
  design	
  

@leekgroup	
  

@simplystats	
  
6p	
  #6:	
  randomiza6on	
  

@leekgroup	
  

@simplystats	
  
an	
  example	
  study	
  

@leekgroup	
  

@simplystats	
  
a	
  bad	
  design	
  

@leekgroup	
  

@simplystats	
  
stra6fied	
  design	
  

@leekgroup	
  

@simplystats	
  
more	
  good	
  study	
  characteris6cs	
  
•  Balanced	
  

@leekgroup	
  
•  Replicated	
  
•  Has	
  Controls	
  
@simp...
6p	
  #7:	
  look	
  at	
  the	
  data	
  

@leekgroup	
  

hNp://en.wikipedia.org/wiki/Anscombe's_quartet	
  

@simplysta...
summarizing	
  data	
  

@leekgroup	
  

hNp://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/	
  

@simplystats	
  
replicates	
  

@leekgroup	
  

@simplystats	
  
watch	
  the	
  scale!	
  

@leekgroup	
  

@simplystats	
  
log	
  transform	
  is	
  common/useful	
  

@leekgroup	
  

@simplystats	
  
bland-­‐altman	
  plots	
  

@leekgroup	
  

hNp://en.wikipedia.org/wiki/Bland%E2%80%93Altman_plot	
   @simplystats	
  
beware	
  ridiculograms!	
  

@leekgroup	
  

@simplystats	
  
ack!	
  math!	
  

X1,…, X M

Observa6ons:	
  

M

1
X = ∑ Xi
M i=1

Averages:	
  

SD2	
  or	
  variances:	
  

€

€

M
@...
an	
  important	
  issue	
  

@leekgroup	
  

@simplystats	
  
t-­‐sta%s%c:	
  you’ll	
  see	
  this	
  a	
  lot*	
  
Y −X
2
Y

2
X

s
s
+
N M
@leekgroup	
  

Invented	
  to	
  improve	...
p-­‐values	
  
Original	
  Sta6s6c	
  

@leekgroup	
  

@simplystats	
  
how	
  to	
  calculate	
  	
  
Observed	
  Sta6s6c	
  =	
  2	
  

	
  	
  	
  	
  	
  	
  @leekgroup	
   |Sperm|	
  ≥	
  |...
6p	
  #8:	
  know	
  what	
  a	
  p-­‐value	
  is(n’t)	
  
The	
  probability	
  of	
  observing	
  a	
  sta6s6c	
  that	
...
an	
  easy	
  mistake	
  to	
  make	
  

@leekgroup	
  

@simplystats	
  
a	
  problem	
  

@leekgroup	
  

@simplystats	
  
a	
  problem	
  

@leekgroup	
  

@simplystats	
  
a	
  problem	
  

@leekgroup	
  

@simplystats	
  
mul6ple	
  comparison	
  error	
  rates	
  
•  Family	
  wise	
  error	
  rate:	
  

Pr(# False Positives ≥ 1)	
  
•  Fals...
difference	
  in	
  interpreta6on	
  
Suppose	
  550	
  out	
  of	
  10,000	
  genes	
  are	
  significant	
  at	
  0.05	
  ...
read	
  this	
  

@leekgroup	
  

hNp://www.pnas.org/content/100/16/9440.long	
  

@simplystats	
  
the	
  inevitable	
  

@leekgroup	
  

hNp://simplysta6s6cs.org/2013/08/26/sta6s6cs-­‐meme-­‐sad-­‐
p-­‐value-­‐bear/	
  
...
why	
  I’m	
  sympathe6c	
  

@leekgroup	
  

@simplystats	
  
beware	
  of	
  “hacking”	
  sta6s6cs	
  

@leekgroup	
  

@simplystats	
  
be	
  nice	
  to	
  the	
  poor	
  sta6s6cian	
  

@leekgroup	
  

@simplystats	
  
6p	
  #9:	
  correla6on	
  and	
  causa6on	
  

@leekgroup	
  

hNp://xkcd.com/552/	
  

@simplystats	
  
most	
  common	
  mistake	
  

Fit	
  regression	
  models	
  (correla7ons)	
  followed	
  by:	
  
	
  
“In	
  summary,	
 ...
predic6on	
  and	
  associa6on	
  

@leekgroup	
  

@simplystats	
  
diagnos6cs	
  

@leekgroup	
  

@simplystats	
  
6p	
  #10:	
  know	
  these	
  quan66es	
  

@leekgroup	
  

@simplystats	
  
key	
  quan66es	
  as	
  frac6ons	
  

@leekgroup	
  

@simplystats	
  
important	
  to	
  keep	
  in	
  mind	
  

@leekgroup	
  

@simplystats	
  
general	
  popula6on	
  

@leekgroup	
  

@simplystats	
  
general	
  popula6on	
  

@leekgroup	
  

@simplystats	
  
at	
  risk	
  subpopula6on	
  

@leekgroup	
  

@simplystats	
  
at	
  risk	
  subpopula6on	
  

@leekgroup	
  

@simplystats	
  
summary	
  of	
  6ps	
  
1.  know	
  the	
  analysis	
  
2.  care	
  about	
  the	
  analysis	
  
3.  have	
  a	
  data	
 ...
Upcoming SlideShare
Loading in …5
×

Why You Should Care About Statistics - Jeff Leek

965 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
965
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
22
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Why You Should Care About Statistics - Jeff Leek

  1. 1. why  you  should  care  about  sta6s6cs   Jeff  Leek   Johns  Hopkins  Bloomberg  Biosta6s6cs   @leekgroup     jtleek@gmail.com @simplystats  
  2. 2. credits   •  slides  shamelessly  borrowed  from:   –  Ingo  Ruczinski  (Google:  “ingo’s  pond”)   –  Josh  Akey  (UW  Genomics)   –  Karl  Broman  (Google:  “the  stupidest  thing  broman”)   @leekgroup   @simplystats  
  3. 3. why  this  stuff  maNers   @leekgroup   @simplystats  
  4. 4. seems  like  an  exci6ng  result!   @leekgroup   hNp://www.nature.com/nm/journal/v12/n11/full/nm1491.html   @simplystats  
  5. 5. stunning  problems   @leekgroup   @simplystats  
  6. 6. how  it  went  down   @leekgroup   hNp://www.nature.com/news/2011/110111/full/ 469139a/box/1.html   @simplystats  
  7. 7. s6ll  going  on   @leekgroup   @simplystats  
  8. 8. worth  a  watch   @leekgroup   hNp://www.birs.ca/events/2013/5-­‐day-­‐workshops/ 13w5083/videos/watch/201308141121-­‐Baggerly.mp4   @simplystats  
  9. 9. worth  a  read   @leekgroup   hNp://www.iom.edu/Reports/2012/Evolu6on-­‐ of-­‐Transla6onal-­‐Omics.aspx   @simplystats  
  10. 10. what  were  the  problems?   •  irreproducibility   •  lack  of  coopera6on   Transparency     •  silly  predic6on  rules   •  study  design/batch  effects   @leekgroup   •  procedures  not  locked  down     Exper6se   @simplystats  
  11. 11. 6p  #1:  know  the  analysis   @leekgroup   hNp://bit.ly/OgW3xv   @simplystats  
  12. 12. 6p  #2:  care  about  the  analysis   @leekgroup   Drinkel et al. Oganometalics 2013 @simplystats  
  13. 13. 6p  #3:  have  a  data/analysis  sharing  plan   @leekgroup   hNp://www.nature.com/nature/journal/v467/ n7314/full/467401b.html   @simplystats  
  14. 14. 6p  #4:  know  where  to  get  help   @leekgroup   hNp://www.biostat.jhsph.edu/consult/   @simplystats  
  15. 15. 6p  #5:  no  subs6tute  for  the  real  thing   @leekgroup   @simplystats  
  16. 16. “central  dogma”  of  sta6s6cs   @leekgroup   Adapted  from  Josh  Akey   @simplystats  
  17. 17. sample  size   @leekgroup   @simplystats  
  18. 18. some  experiment   @leekgroup   @simplystats  
  19. 19. example  calcula6ons   @leekgroup   @simplystats  
  20. 20. beNer  technology  ≠  no  variability   @leekgroup   hNp://www.nature.com/nbt/journal/v29/n7/full/nbt.1910.html   @simplystats  
  21. 21. power   @leekgroup   @simplystats  
  22. 22. bad  study  design   78%  of  genes  differen6ally   expressed     @leekgroup   @simplystats  
  23. 23. group  and  date  “confounded”   @leekgroup   @simplystats  
  24. 24. uh-­‐oh!   @leekgroup   @simplystats  
  25. 25. confounding:     associa6on  between  shoe  size  and  literacy  in  kids     @leekgroup   @simplystats  
  26. 26. proteomics   @leekgroup   @simplystats  
  27. 27. proteomics   @leekgroup   @simplystats  
  28. 28. gene  expression   @leekgroup   @simplystats  
  29. 29. gene  expression   @leekgroup   @simplystats  
  30. 30. gwas   @leekgroup   @simplystats  
  31. 31. gwas   @leekgroup   @simplystats  
  32. 32. confounding  is  a  big  deal   @leekgroup   hNp://www.nature.com/nrg/journal/v11/n10/full/ nrg2825.html   @simplystats  
  33. 33. confounding  and  study  design   @leekgroup   @simplystats  
  34. 34. 6p  #6:  randomiza6on   @leekgroup   @simplystats  
  35. 35. an  example  study   @leekgroup   @simplystats  
  36. 36. a  bad  design   @leekgroup   @simplystats  
  37. 37. stra6fied  design   @leekgroup   @simplystats  
  38. 38. more  good  study  characteris6cs   •  Balanced   @leekgroup   •  Replicated   •  Has  Controls   @simplystats  
  39. 39. 6p  #7:  look  at  the  data   @leekgroup   hNp://en.wikipedia.org/wiki/Anscombe's_quartet   @simplystats  
  40. 40. summarizing  data   @leekgroup   hNp://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/   @simplystats  
  41. 41. replicates   @leekgroup   @simplystats  
  42. 42. watch  the  scale!   @leekgroup   @simplystats  
  43. 43. log  transform  is  common/useful   @leekgroup   @simplystats  
  44. 44. bland-­‐altman  plots   @leekgroup   hNp://en.wikipedia.org/wiki/Bland%E2%80%93Altman_plot   @simplystats  
  45. 45. beware  ridiculograms!   @leekgroup   @simplystats  
  46. 46. ack!  math!   X1,…, X M Observa6ons:   M 1 X = ∑ Xi M i=1 Averages:   SD2  or  variances:   € € M @leekgroup   1 Y1,…,YN N 1 Y = ∑Yi N i=1 N 1 2 s =€ (X i − X ) s = ∑ ∑ (Yi − Y ) € M −1 i=1 N −1 i=1 2 X 2 2 Y @simplystats  
  47. 47. an  important  issue   @leekgroup   @simplystats  
  48. 48. t-­‐sta%s%c:  you’ll  see  this  a  lot*   Y −X 2 Y 2 X s s + N M @leekgroup   Invented  to  improve  beer:     hNp://en.wikipedia.org/wiki/Student's_t-­‐test   € @simplystats  
  49. 49. p-­‐values   Original  Sta6s6c   @leekgroup   @simplystats  
  50. 50. how  to  calculate     Observed  Sta6s6c  =  2              @leekgroup   |Sperm|  ≥  |Sobs|}                            {#   P-­‐value  =                                          #  of  Permuta6ons   @simplystats  
  51. 51. 6p  #8:  know  what  a  p-­‐value  is(n’t)   The  probability  of  observing  a  sta6s6c  that   extreme  if  the  null  hypothesis  is  true.       The  p-­‐value  is  not   •  Probability  the  null  is  true   •  Probability  the  alterna6ve  is  true   @leekgroup   •  A  measure  of  sta6s6cal  evidence   @simplystats  
  52. 52. an  easy  mistake  to  make   @leekgroup   @simplystats  
  53. 53. a  problem   @leekgroup   @simplystats  
  54. 54. a  problem   @leekgroup   @simplystats  
  55. 55. a  problem   @leekgroup   @simplystats  
  56. 56. mul6ple  comparison  error  rates   •  Family  wise  error  rate:   Pr(# False Positives ≥ 1)   •  False  discovery  rate:     " # False Positives % E$ ' # # Of Discoveries& @leekgroup   •  EFP  (e-­‐values)   E[# False Positives]   @simplystats  
  57. 57. difference  in  interpreta6on   Suppose  550  out  of  10,000  genes  are  significant  at  0.05   level     P-­‐value  <  0.05   Expect  0.05*10,000  =  500  false  posi6ves     False  Discovery  Rate  <  0.05   Expect  0.05*550  =  27.5  false  posi6ves     @leekgroup   Family  Wise  Error  Rate  <  0.05   The  probability  of  at  least  1  false  posi6ve  ≤  0.05   @simplystats  
  58. 58. read  this   @leekgroup   hNp://www.pnas.org/content/100/16/9440.long   @simplystats  
  59. 59. the  inevitable   @leekgroup   hNp://simplysta6s6cs.org/2013/08/26/sta6s6cs-­‐meme-­‐sad-­‐ p-­‐value-­‐bear/   @simplystats  
  60. 60. why  I’m  sympathe6c   @leekgroup   @simplystats  
  61. 61. beware  of  “hacking”  sta6s6cs   @leekgroup   @simplystats  
  62. 62. be  nice  to  the  poor  sta6s6cian   @leekgroup   @simplystats  
  63. 63. 6p  #9:  correla6on  and  causa6on   @leekgroup   hNp://xkcd.com/552/   @simplystats  
  64. 64. most  common  mistake   Fit  regression  models  (correla7ons)  followed  by:     “In  summary,  our  results  support  a  causal  rela%onship  of  breasxeeding  in  infancy  with  recep6ve    language  at  age  3  and  with  verbal  and  nonverbal  IQ  at  school  age.  These  findings  support     Na6onal  and  interna6onal  recommenda6ons  to  promote  exclusive  breasxeeding  through  age  6   months  and  con6nua6on  of  breasxeeding  through  at  least  age  1  year.”   @leekgroup     @simplystats  
  65. 65. predic6on  and  associa6on   @leekgroup   @simplystats  
  66. 66. diagnos6cs   @leekgroup   @simplystats  
  67. 67. 6p  #10:  know  these  quan66es   @leekgroup   @simplystats  
  68. 68. key  quan66es  as  frac6ons   @leekgroup   @simplystats  
  69. 69. important  to  keep  in  mind   @leekgroup   @simplystats  
  70. 70. general  popula6on   @leekgroup   @simplystats  
  71. 71. general  popula6on   @leekgroup   @simplystats  
  72. 72. at  risk  subpopula6on   @leekgroup   @simplystats  
  73. 73. at  risk  subpopula6on   @leekgroup   @simplystats  
  74. 74. summary  of  6ps   1.  know  the  analysis   2.  care  about  the  analysis   3.  have  a  data  sharing  plan   4.  know  where/when  to  get  help   5.  this  isn’t  a  subs6tute  for  learning  sta6s6cs   6.  randomize  in  your  study  design   7.  look  at  your  data   8.  know  what  p-­‐values  are(n’t)   9.  beware  causality  creep   @leekgroup   10. know  the  key  diagnos6c  quan66es     @simplystats  

×