Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data science as a science

3,049 views

Published on

This is a talk focusing on data science as a science given by Jeff Leek at JSM 2016

Published in: Science

Data science as a science

  1. 1. Evidence based data analysis @jtleek
  2. 2. Data science as a Science (DSaaS) @jtleek
  3. 3. “Data science is as much art as it is science.”
  4. 4. Wouldn’t it be amazing if we got 2,000 people to learn statistics! “ ”-Jeff Leek 7/17/12
  5. 5. date: 7/19/12 from: jtleek@gmail.com Roger let me know you gave him a ballpark figure for the number of students registered for his course "Computing for Data Analysis”. Could you give me an idea of how many have registered for my course "Data Analysis?”
  6. 6. date: 7/19/12 from: pangwei@coursera.org Hi Jeff, 7,000 students! It's pretty awesome. (You'll be able to check this out yourself next week, once the class sites are up.)
  7. 7. date: 7/19/12 from: rdpeng@gmail.com You are f**ed. -roger
  8. 8. 9 classes 1 month long Always open
  9. 9. Data Science Specialization Total Enrollments: 3,815,890 Total Completions: 409,712 Genomic Data Science Specialization Total Enrollments: 173,495 Total Completions: 10,826 Executive Data Science Specialization Total Enrollments: 62,076
  10. 10. A theoretical model Data
  11. 11. A theoretical model Data
  12. 12. Y = some outcome X = some covariate D = (X,Y) lm(Y ~ X)
  13. 13. Y = some outcome X = some covariate D = (X,Y) lm(Y ~ X) Leek and Peng, Nature 2015
  14. 14. F0 Ul F0(S) Ul F0(Y) Fithian, Sun and Taylor arXiv 2015
  15. 15. σ-algebra “what we know” F0 Ul F0(S) Ul F0(Y)
  16. 16. “we’ve done nothing” F0 Ul F0(S) Ul F0(Y)
  17. 17. “we did model selection” F0 Ul F0(S) Ul F0(Y)
  18. 18. “we looked at all the data” F0 Ul F0(S) Ul F0(Y)
  19. 19. E[β |F0] ≠ E[β |F0(S)]
  20. 20. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim Patil, Peng and Leek biorXiv 2016
  21. 21. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim Patil, Peng and Leek biorXiv 2016 F0 Ul F0(1P,Q(H)) F0(1ED(E)) F0(1ED;E(D)) F0(1AP;A(C)) F0(1C(A*)) UlUlUlUl
  22. 22. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim Patil, Peng and Leek biorXiv 2016 F0 Ul F0(1P,Q(H)) F0(1ED(E)) F0(1ED;E(D)) F0(1AP;A(C)) F0(1C(A*)) UlUlUlUl
  23. 23. A theoretical model Data
  24. 24. Slide courtesy Hadley Wickham
  25. 25. Who? What? When? Why? Where? How? Slide courtesy Hadley Wickham
  26. 26. Who? What? When? Why? Where? How? Where Ingo is working
  27. 27. Who? What? When? Why? Where? How? Slide courtesy Hadley Wickham Base R Lassodplyr googlesheets ppt
  28. 28. Who? What? When? Why? Where? How? Slide courtesy Hadley Wickham Bad life choices? Sparsity! David Robinson told me Spreadsheets  Hedgemony
  29. 29. Cleveland and McGill JASA 1984
  30. 30. Leek & Peng 2015 PNAS
  31. 31. Experiment
  32. 32. Leek and Peng, Science 2015
  33. 33. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim E[S| F0(1c(W))
  34. 34. We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or whether they have lung cancer. We claim that smoking is related to lung cancer in the larger population.
  35. 35. 79% 17% Inferential vs Causal n=47,141
  36. 36. We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or whether they have lung cancer. We claim that smoking is related to lung cancer in the larger population. We explain we think that the reason for this relationship is because cigarette smoke contains known carcinogens such as benzene, which make cells in lungs become cancerous.
  37. 37. 65% 32 % Inferential vs Causal n=47,141
  38. 38. Experiment
  39. 39. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim E[Est| F0(1c(A))
  40. 40. 69% vs 40% n=1,985
  41. 41. Experiment
  42. 42. E[Claim | F0(1set(base)(A))] - E[Claim | F0(1set(ggplot2)(A))] Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim
  43. 43. 1. Make a plot that answers the question: what is the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) in New York? 2. Make a plot (possibly multi-panel) that answers the question: how does the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) vary by medical condition (DRG.Definition) and the state in which care was received (Provider.State)? Use only the [ggplot2/base R] graphics system (not base R or lattice) to make your figure.
  44. 44. “Does the plot clearly show the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) in New York?” G: 5/22 (23%) vs. B: 5/12 (42%)
  45. 45. “Does the plot clearly show the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) vary by medical condition (DRG.Definition) and the state in which care was received (Provider.State)?” G: 7/22 (32%) vs. B: 5/12 (42%)
  46. 46. “Is the plot visually pleasing?” G: 21/22 (95%) vs. B: 10/12 (83%) G: 20/22 (91%) vs. B: 8/12 (67%)
  47. 47. “Do the plot text and labels use full words instead of abbreviations?” G: 21/22 (95%) vs. B: 12/12 (100%) G: 11/22 (50%) vs. B: 5/12 (42%)
  48. 48. A theoretical model Data
  49. 49. Data science as a Science (DSaaS) @jtleek

×