Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
×

# Data science as a science

3,049 views

Published on

This is a talk focusing on data science as a science given by Jeff Leek at JSM 2016

Published in: Science
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

### Data science as a science

1. 1. Evidence based data analysis @jtleek
2. 2. Data science as a Science (DSaaS) @jtleek
3. 3. “Data science is as much art as it is science.”
4. 4. Wouldn’t it be amazing if we got 2,000 people to learn statistics! “ ”-Jeff Leek 7/17/12
5. 5. date: 7/19/12 from: jtleek@gmail.com Roger let me know you gave him a ballpark figure for the number of students registered for his course "Computing for Data Analysis”. Could you give me an idea of how many have registered for my course "Data Analysis?”
6. 6. date: 7/19/12 from: pangwei@coursera.org Hi Jeff, 7,000 students! It's pretty awesome. (You'll be able to check this out yourself next week, once the class sites are up.)
7. 7. date: 7/19/12 from: rdpeng@gmail.com You are f**ed. -roger
8. 8. 9 classes 1 month long Always open
9. 9. Data Science Specialization Total Enrollments: 3,815,890 Total Completions: 409,712 Genomic Data Science Specialization Total Enrollments: 173,495 Total Completions: 10,826 Executive Data Science Specialization Total Enrollments: 62,076
10. 10. A theoretical model Data
11. 11. A theoretical model Data
12. 12. Y = some outcome X = some covariate D = (X,Y) lm(Y ~ X)
13. 13. Y = some outcome X = some covariate D = (X,Y) lm(Y ~ X) Leek and Peng, Nature 2015
14. 14. F0 Ul F0(S) Ul F0(Y) Fithian, Sun and Taylor arXiv 2015
15. 15. σ-algebra “what we know” F0 Ul F0(S) Ul F0(Y)
16. 16. “we’ve done nothing” F0 Ul F0(S) Ul F0(Y)
17. 17. “we did model selection” F0 Ul F0(S) Ul F0(Y)
18. 18. “we looked at all the data” F0 Ul F0(S) Ul F0(Y)
19. 19. E[β |F0] ≠ E[β |F0(S)]
20. 20. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim Patil, Peng and Leek biorXiv 2016
21. 21. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim Patil, Peng and Leek biorXiv 2016 F0 Ul F0(1P,Q(H)) F0(1ED(E)) F0(1ED;E(D)) F0(1AP;A(C)) F0(1C(A*)) UlUlUlUl
22. 22. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim Patil, Peng and Leek biorXiv 2016 F0 Ul F0(1P,Q(H)) F0(1ED(E)) F0(1ED;E(D)) F0(1AP;A(C)) F0(1C(A*)) UlUlUlUl
23. 23. A theoretical model Data
24. 24. Slide courtesy Hadley Wickham
25. 25. Who? What? When? Why? Where? How? Slide courtesy Hadley Wickham
26. 26. Who? What? When? Why? Where? How? Where Ingo is working
27. 27. Who? What? When? Why? Where? How? Slide courtesy Hadley Wickham Base R Lassodplyr googlesheets ppt
28. 28. Who? What? When? Why? Where? How? Slide courtesy Hadley Wickham Bad life choices? Sparsity! David Robinson told me Spreadsheets  Hedgemony
29. 29. Cleveland and McGill JASA 1984
30. 30. Leek & Peng 2015 PNAS
31. 31. Experiment
32. 32. Leek and Peng, Science 2015
33. 33. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim E[S| F0(1c(W))
34. 34. We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or whether they have lung cancer. We claim that smoking is related to lung cancer in the larger population.
35. 35. 79% 17% Inferential vs Causal n=47,141
36. 36. We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or whether they have lung cancer. We claim that smoking is related to lung cancer in the larger population. We explain we think that the reason for this relationship is because cigarette smoke contains known carcinogens such as benzene, which make cells in lungs become cancerous.
37. 37. 65% 32 % Inferential vs Causal n=47,141
38. 38. Experiment
39. 39. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim E[Est| F0(1c(A))
40. 40. 69% vs 40% n=1,985
41. 41. Experiment
42. 42. E[Claim | F0(1set(base)(A))] - E[Claim | F0(1set(ggplot2)(A))] Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim
43. 43. 1. Make a plot that answers the question: what is the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) in New York? 2. Make a plot (possibly multi-panel) that answers the question: how does the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) vary by medical condition (DRG.Definition) and the state in which care was received (Provider.State)? Use only the [ggplot2/base R] graphics system (not base R or lattice) to make your figure.
44. 44. “Does the plot clearly show the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) in New York?” G: 5/22 (23%) vs. B: 5/12 (42%)
45. 45. “Does the plot clearly show the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) vary by medical condition (DRG.Definition) and the state in which care was received (Provider.State)?” G: 7/22 (32%) vs. B: 5/12 (42%)
46. 46. “Is the plot visually pleasing?” G: 21/22 (95%) vs. B: 10/12 (83%) G: 20/22 (91%) vs. B: 8/12 (67%)
47. 47. “Do the plot text and labels use full words instead of abbreviations?” G: 21/22 (95%) vs. B: 12/12 (100%) G: 11/22 (50%) vs. B: 5/12 (42%)
48. 48. A theoretical model Data
49. 49. Data science as a Science (DSaaS) @jtleek