Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
10 things 
statistics 
taught us about 
big data
Research Blogging Teaching
Research Blogging Teaching 
jtleek.com
Research Blogging Teaching 
simplystatistics.org
Research Blogging Teaching 
jhudatascience.org
from: jtleek@gmail.com 
Roger let me know you gave him a 
ballpark figure for the number of 
students registered for his c...
from: pangwei@coursera.org 
Hi Jeff, 
7,000 students! It's pretty awesome. 
(You'll be able to check this out yourself 
ne...
from: rdpeng@gmail.com 
You are f**ed. 
-roger
Enrollment 
Time
Enrollment 
Time
Enrollment 
Time
9 classes 
1 month long 
Every month
Enrollment 
Time
1,000,000+ 
Enrolled
http://goo.gl/vQK0RH
http://goo.gl/xWAlPi
10 statistics things 
1. Problem first, not solution backward 
2. Define a metric for success first 
3. Analyze interactiv...
Problem first 
Not solution backward
http://goo.gl/3vA1OB
http://hyperboleandahalf.blogspot.com/
http://cran.r-project.org//
http://bioconductor.org/
Define a metric for success 
Before you start
http://www.agendia.com/managed-care/breast-cancer/mammaprint/
89% sensitivity 
42% specificity 
65% accuracy
http://www.biomedcentral.com/1471-2164/14/336/figure/F3
Analyze 
Interactively
http://had.co.nz/
https://twitter.com/EllieMcDonagh/status/469184554549248000
http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
Plot your data 
First and always
http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
h$p://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/
Know your real 
sample size
Watch out for 
confounders
http://xkcd.com/552/
shoe size & literacy
Correct for 
multiple testing
http://xkcd.com/882/
http://xkcd.com/882/
http://xkcd.com/882/
Average 
many predictors
5 independent, 
70% accurate classifiers 
10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5)= 
83.7% accuracy 
http://www.cbcb.umd.edu/~hc...
101 independent, 
70% accurate classifiers 
99.9% accuracy 
http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/Ens...
http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdf 
Adapted from Todd Halloway
Smooth (average) 
over time and space
http://simplystatistics.org/2014/02/13/loess-explained-in-a-gif/
http://fivethirtyeight.com/
Have others 
check your work
10 statistics things 
1. Problem first, not solution backward 
2. Define a metric for success first 
3. Analyze interactiv...
jtleek.com/talks
10 things statistics taught us about big data
10 things statistics taught us about big data
10 things statistics taught us about big data
10 things statistics taught us about big data
10 things statistics taught us about big data
10 things statistics taught us about big data
10 things statistics taught us about big data
10 things statistics taught us about big data
10 things statistics taught us about big data
10 things statistics taught us about big data
10 things statistics taught us about big data
10 things statistics taught us about big data
10 things statistics taught us about big data
Upcoming SlideShare
Loading in …5
×

10 things statistics taught us about big data

6,445 views

Published on

Talk at DC Business Intelligentsia.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

10 things statistics taught us about big data

  1. 1. 10 things statistics taught us about big data
  2. 2. Research Blogging Teaching
  3. 3. Research Blogging Teaching jtleek.com
  4. 4. Research Blogging Teaching simplystatistics.org
  5. 5. Research Blogging Teaching jhudatascience.org
  6. 6. from: jtleek@gmail.com Roger let me know you gave him a ballpark figure for the number of students registered for his course "Computing for Data Analysis”. Could you give me an idea of how many have registered for my course "Data Analysis?”
  7. 7. from: pangwei@coursera.org Hi Jeff, 7,000 students! It's pretty awesome. (You'll be able to check this out yourself next week, once the class sites are up.)
  8. 8. from: rdpeng@gmail.com You are f**ed. -roger
  9. 9. Enrollment Time
  10. 10. Enrollment Time
  11. 11. Enrollment Time
  12. 12. 9 classes 1 month long Every month
  13. 13. Enrollment Time
  14. 14. 1,000,000+ Enrolled
  15. 15. http://goo.gl/vQK0RH
  16. 16. http://goo.gl/xWAlPi
  17. 17. 10 statistics things 1. Problem first, not solution backward 2. Define a metric for success first 3. Analyze interactively 4. Plot your data first and always 5. Know your real sample size 6. Watch out for confounders 7. Correct for multiple testing 8. Average many predictors 9. Smooth over time and space 10. Have others check your work http://goo.gl/wTAuvR
  18. 18. Problem first Not solution backward
  19. 19. http://goo.gl/3vA1OB
  20. 20. http://hyperboleandahalf.blogspot.com/
  21. 21. http://cran.r-project.org//
  22. 22. http://bioconductor.org/
  23. 23. Define a metric for success Before you start
  24. 24. http://www.agendia.com/managed-care/breast-cancer/mammaprint/
  25. 25. 89% sensitivity 42% specificity 65% accuracy
  26. 26. http://www.biomedcentral.com/1471-2164/14/336/figure/F3
  27. 27. Analyze Interactively
  28. 28. http://had.co.nz/
  29. 29. https://twitter.com/EllieMcDonagh/status/469184554549248000
  30. 30. http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
  31. 31. Plot your data First and always
  32. 32. http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
  33. 33. h$p://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/
  34. 34. Know your real sample size
  35. 35. Watch out for confounders
  36. 36. http://xkcd.com/552/
  37. 37. shoe size & literacy
  38. 38. Correct for multiple testing
  39. 39. http://xkcd.com/882/
  40. 40. http://xkcd.com/882/
  41. 41. http://xkcd.com/882/
  42. 42. Average many predictors
  43. 43. 5 independent, 70% accurate classifiers 10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5)= 83.7% accuracy http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdf Adapted from Todd Halloway
  44. 44. 101 independent, 70% accurate classifiers 99.9% accuracy http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdf Adapted from Todd Halloway
  45. 45. http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdf Adapted from Todd Halloway
  46. 46. Smooth (average) over time and space
  47. 47. http://simplystatistics.org/2014/02/13/loess-explained-in-a-gif/
  48. 48. http://fivethirtyeight.com/
  49. 49. Have others check your work
  50. 50. 10 statistics things 1. Problem first, not solution backward 2. Define a metric for success first 3. Analyze interactively 4. Plot your data first and always 5. Know your real sample size 6. Watch out for confounders 7. Correct for multiple testing 8. Average many predictors 9. Smooth over time and space 10. Have others check your work http://goo.gl/wTAuvR
  51. 51. jtleek.com/talks

×