Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Leek romesf-2015 by jtleek 1414 views
- 10 things statistics taught us abou... by jtleek 6446 views
- R in BI and Streaming Applications ... by Lou Bajuk 1568 views
- EARL Sept 2016 R consortium by Lou Bajuk 301 views
- Applying the R Language to BI and R... by Lou Bajuk 883 views
- Flash talk about Johns Hopkins Bios... by jtleek 951 views

3,113 views

Published on

This is a talk focusing on data science as a science given by Jeff Leek at JSM 2016

Published in:
Science

No Downloads

Total views

3,113

On SlideShare

0

From Embeds

0

Number of Embeds

536

Shares

0

Downloads

43

Comments

0

Likes

8

No embeds

No notes for slide

- 1. Evidence based data analysis @jtleek
- 2. Data science as a Science (DSaaS) @jtleek
- 3. “Data science is as much art as it is science.”
- 4. Wouldn’t it be amazing if we got 2,000 people to learn statistics! “ ”-Jeff Leek 7/17/12
- 5. date: 7/19/12 from: jtleek@gmail.com Roger let me know you gave him a ballpark figure for the number of students registered for his course "Computing for Data Analysis”. Could you give me an idea of how many have registered for my course "Data Analysis?”
- 6. date: 7/19/12 from: pangwei@coursera.org Hi Jeff, 7,000 students! It's pretty awesome. (You'll be able to check this out yourself next week, once the class sites are up.)
- 7. date: 7/19/12 from: rdpeng@gmail.com You are f**ed. -roger
- 8. 9 classes 1 month long Always open
- 9. Data Science Specialization Total Enrollments: 3,815,890 Total Completions: 409,712 Genomic Data Science Specialization Total Enrollments: 173,495 Total Completions: 10,826 Executive Data Science Specialization Total Enrollments: 62,076
- 10. A theoretical model Data
- 11. A theoretical model Data
- 12. Y = some outcome X = some covariate D = (X,Y) lm(Y ~ X)
- 13. Y = some outcome X = some covariate D = (X,Y) lm(Y ~ X) Leek and Peng, Nature 2015
- 14. F0 Ul F0(S) Ul F0(Y) Fithian, Sun and Taylor arXiv 2015
- 15. σ-algebra “what we know” F0 Ul F0(S) Ul F0(Y)
- 16. “we’ve done nothing” F0 Ul F0(S) Ul F0(Y)
- 17. “we did model selection” F0 Ul F0(S) Ul F0(Y)
- 18. “we looked at all the data” F0 Ul F0(S) Ul F0(Y)
- 19. E[β |F0] ≠ E[β |F0(S)]
- 20. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim Patil, Peng and Leek biorXiv 2016
- 21. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim Patil, Peng and Leek biorXiv 2016 F0 Ul F0(1P,Q(H)) F0(1ED(E)) F0(1ED;E(D)) F0(1AP;A(C)) F0(1C(A*)) UlUlUlUl
- 22. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim Patil, Peng and Leek biorXiv 2016 F0 Ul F0(1P,Q(H)) F0(1ED(E)) F0(1ED;E(D)) F0(1AP;A(C)) F0(1C(A*)) UlUlUlUl
- 23. A theoretical model Data
- 24. Slide courtesy Hadley Wickham
- 25. Who? What? When? Why? Where? How? Slide courtesy Hadley Wickham
- 26. Who? What? When? Why? Where? How? Where Ingo is working
- 27. Who? What? When? Why? Where? How? Slide courtesy Hadley Wickham Base R Lassodplyr googlesheets ppt
- 28. Who? What? When? Why? Where? How? Slide courtesy Hadley Wickham Bad life choices? Sparsity! David Robinson told me Spreadsheets Hedgemony
- 29. Cleveland and McGill JASA 1984
- 30. Leek & Peng 2015 PNAS
- 31. Experiment
- 32. Leek and Peng, Science 2015
- 33. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim E[S| F0(1c(W))
- 34. We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or whether they have lung cancer. We claim that smoking is related to lung cancer in the larger population.
- 35. 79% 17% Inferential vs Causal n=47,141
- 36. We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or whether they have lung cancer. We claim that smoking is related to lung cancer in the larger population. We explain we think that the reason for this relationship is because cigarette smoke contains known carcinogens such as benzene, which make cells in lungs become cancerous.
- 37. 65% 32 % Inferential vs Causal n=47,141
- 38. Experiment
- 39. Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim E[Est| F0(1c(A))
- 40. 69% vs 40% n=1,985
- 41. Experiment
- 42. E[Claim | F0(1set(base)(A))] - E[Claim | F0(1set(ggplot2)(A))] Population Question Hypothesis Experimental Design Experimentor Data Analysis Plan Analyst Code Estimate Claim
- 43. 1. Make a plot that answers the question: what is the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) in New York? 2. Make a plot (possibly multi-panel) that answers the question: how does the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) vary by medical condition (DRG.Definition) and the state in which care was received (Provider.State)? Use only the [ggplot2/base R] graphics system (not base R or lattice) to make your figure.
- 44. “Does the plot clearly show the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) in New York?” G: 5/22 (23%) vs. B: 5/12 (42%)
- 45. “Does the plot clearly show the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) vary by medical condition (DRG.Definition) and the state in which care was received (Provider.State)?” G: 7/22 (32%) vs. B: 5/12 (42%)
- 46. “Is the plot visually pleasing?” G: 21/22 (95%) vs. B: 10/12 (83%) G: 20/22 (91%) vs. B: 8/12 (67%)
- 47. “Do the plot text and labels use full words instead of abbreviations?” G: 21/22 (95%) vs. B: 12/12 (100%) G: 11/22 (50%) vs. B: 5/12 (42%)
- 48. A theoretical model Data
- 49. Data science as a Science (DSaaS) @jtleek

No public clipboards found for this slide

Be the first to comment