Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
fixing the leaks in the genomics
http://jhudatascience.org/
https://www.coursera.org/specialization/genomics/41
@simplystats
http://simplystatistics.org
@jtleek
http://www.jtleek.com
https://www.counsyl.com/
Their basic pitch was
“Genomics is a fraud”
“
”
http://www.technologyreview.com/news/535771/a-contrarian-in-biotech/
“The explosive growth of next-generation
sequencing data submitted into the SRA
exceeds the growth rate of storage
capacit...
3cost
analyst variation
motivation
1cost
costs
money
interpretability
http://arxiv.org/pdf/math/0606441.pdf
http://www.ncbi.nlm.nih.gov/pubmed/19276151
@leekgroup
http://www.ncbi.nlm.nih.gov/pubmed/25788628
http://www.ncbi.nlm.nih.gov/pubmed/25788628
Agilent/Grade 1 Agilent/Grade 3 Illumina/Grade1 Illumina/Grade3
100%
75%
50%
25%
0%
Accuracy
Pam Scaled Pam Unscaled TSP
h...
algorithm
1.select useful pairs
2.screen pairs for association
3.build a simple cart predictor
http://www.ncbi.nlm.nih.gov/pubmed/19276151
Patil et al. (in prep)
Patil et al. (in prep)
Patil et al. (in prep)
@leekgroup
Data:
xik
- value for feature i, sample k
yk
- group indicator for sample k
TSP is (i,j) pair that maximizes:
|...
@leekgroup
zijk
=1(xik
< xjk
)
E[zijk
|yk
] = a0ij
+ a1ij
yk
→ max |a1jk
| = TSP
Patil et al. (in prep)
@leekgroup
• Not the same as TSP
• But |â/s.e.(â)| = |û/s.e.(û)|, algebraically
• “Variance regularized” TSP
• zijk
invari...
@leekgroup
1. Calculate t-statistic for all pairs
2. Choose top pair (or covariate)
3. Continue for a fixed number of pair...
@leekgroup
http://astor.som.jhmi.edu/~marchion//breastTSP.html
@leekgroup
USP7 < RP11-423C15.3
NM_018610 < MTCH1
RND1 < LGALS14
No
Recur
No
Recur
No
Recur
Recur
No Yes
No Yes
No Yes
@leekgroup
@leekgroup
Mammaprint
Patil et al. (in prep)
2analyst variation
what went wrong?
2things
what went wrong? transparency
The data/code weren’t reproducible
what went wrong? transparency
There was a lack of cooperation
what went wrong? expertise
They used silly prediction rules
(Pr(FEC) = 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise
They had study design problems
(Batch effects)
what went wrong? expertise
Their predictions weren’t locked down
Today: Pr(FEC) = 0.8
Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti
analysis was fully reproducible
The problem is that the analysis
was wrong
@leekgroup
http://bit.ly/10vS1yt
@leekgroup
http://bit.ly/OgW3xv
@leekgroup
Drinkel et al. Oganometalics 2013
@leekgroup
@leekgroup
@leekgroup
@leekgroup
http://simplystatistics.tumblr.com/post/19646774024/laws-of-nature-and-the-law-of-patents-supreme-court
3motivation
$(from reducing sample size)
basic idea
randomization isn’t perfect
“rebalance” with baseline covariates
improve estimator precision
Ack Math!!!!
Estimate probability of being in arm given baseline covariates
Calculate initial estimate for each person using each arm
model using propensity score weighted logistic regression
Define a covariate as the residual from fitting the arm-level
models minus the arm-level means and fit new propensity
mode...
Use these propensities to re-fit WLR from (2), then average
predictions to get covariate-adjusted treatment effect
@leekgroup
http://astor.som.jhmi.edu/~marchion//breastTSP.html
@leekgroup
Age, Tumor Size, Grade 5.1%
Age, Tumor Size, Grade,
ER Status
4.9%
Mammaprint Risk
Category (MRC)
5.4%
Age, Tum...
@leekgroup
Age, Tumor Size, Grade 5.1%
Age, Tumor Size, Grade,
ER Status
4.9%
Mammaprint Risk
Category (MRC)
5.4%
Age, Tum...
3cost
analyst variation
motivation
acknowledgements
Leek group
Prasad Patil
Leo Collado Torres
Abhi Nellore
Claire Ruberman
Jack Fu
Kai Kammers
Collaborators...
Prasad Patil
http://www.biostat.jhsph.edu/~prpatil/
Links
https://github.com/leekgroup/sig2trial
http://jtleek.com/talks/
Fixing the leaks in the pipeline from public genomics data to the clinic
Fixing the leaks in the pipeline from public genomics data to the clinic
Fixing the leaks in the pipeline from public genomics data to the clinic
Fixing the leaks in the pipeline from public genomics data to the clinic
Fixing the leaks in the pipeline from public genomics data to the clinic
Fixing the leaks in the pipeline from public genomics data to the clinic
Fixing the leaks in the pipeline from public genomics data to the clinic
Fixing the leaks in the pipeline from public genomics data to the clinic
Fixing the leaks in the pipeline from public genomics data to the clinic
Fixing the leaks in the pipeline from public genomics data to the clinic
Fixing the leaks in the pipeline from public genomics data to the clinic
Fixing the leaks in the pipeline from public genomics data to the clinic
Fixing the leaks in the pipeline from public genomics data to the clinic
Upcoming SlideShare
Loading in …5
×

Fixing the leaks in the pipeline from public genomics data to the clinic

988 views

Published on

A talk about improving reproducibility, simplifying genomic machine learning, and using the resulting predictors to improve power in clinical trials.

Published in: Health & Medicine
  • Be the first to comment

Fixing the leaks in the pipeline from public genomics data to the clinic

  1. 1. fixing the leaks in the genomics
  2. 2. http://jhudatascience.org/
  3. 3. https://www.coursera.org/specialization/genomics/41
  4. 4. @simplystats http://simplystatistics.org
  5. 5. @jtleek http://www.jtleek.com
  6. 6. https://www.counsyl.com/
  7. 7. Their basic pitch was “Genomics is a fraud” “ ” http://www.technologyreview.com/news/535771/a-contrarian-in-biotech/
  8. 8. “The explosive growth of next-generation sequencing data submitted into the SRA exceeds the growth rate of storage capacity ” http://www.ncbi.nlm.nih.gov/pubmed/22009675
  9. 9. 3cost analyst variation motivation
  10. 10. 1cost
  11. 11. costs money interpretability
  12. 12. http://arxiv.org/pdf/math/0606441.pdf
  13. 13. http://www.ncbi.nlm.nih.gov/pubmed/19276151
  14. 14. @leekgroup
  15. 15. http://www.ncbi.nlm.nih.gov/pubmed/25788628
  16. 16. http://www.ncbi.nlm.nih.gov/pubmed/25788628
  17. 17. Agilent/Grade 1 Agilent/Grade 3 Illumina/Grade1 Illumina/Grade3 100% 75% 50% 25% 0% Accuracy Pam Scaled Pam Unscaled TSP http://www.ncbi.nlm.nih.gov/pubmed/25788628
  18. 18. algorithm 1.select useful pairs 2.screen pairs for association 3.build a simple cart predictor
  19. 19. http://www.ncbi.nlm.nih.gov/pubmed/19276151
  20. 20. Patil et al. (in prep)
  21. 21. Patil et al. (in prep)
  22. 22. Patil et al. (in prep)
  23. 23. @leekgroup Data: xik - value for feature i, sample k yk - group indicator for sample k TSP is (i,j) pair that maximizes: |Pr(xik < xjk | yk =1) – Pr(xik < xjk | yk =0)|⌃ ⌃ http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1989150/
  24. 24. @leekgroup zijk =1(xik < xjk ) E[zijk |yk ] = a0ij + a1ij yk → max |a1jk | = TSP Patil et al. (in prep)
  25. 25. @leekgroup • Not the same as TSP • But |â/s.e.(â)| = |û/s.e.(û)|, algebraically • “Variance regularized” TSP • zijk invariant to monotone transformations • Fix parameters → find features E[yk |zijk ] = u0ij + u1ij zijk Patil et al. (in prep)
  26. 26. @leekgroup 1. Calculate t-statistic for all pairs 2. Choose top pair (or covariate) 3. Continue for a fixed number of pairs E[yk |zijk ] = u0ij + u1ij zijk Patil et al. (in prep)
  27. 27. @leekgroup http://astor.som.jhmi.edu/~marchion//breastTSP.html
  28. 28. @leekgroup USP7 < RP11-423C15.3 NM_018610 < MTCH1 RND1 < LGALS14 No Recur No Recur No Recur Recur No Yes No Yes No Yes
  29. 29. @leekgroup
  30. 30. @leekgroup Mammaprint Patil et al. (in prep)
  31. 31. 2analyst variation
  32. 32. what went wrong? 2things
  33. 33. what went wrong? transparency The data/code weren’t reproducible
  34. 34. what went wrong? transparency There was a lack of cooperation
  35. 35. what went wrong? expertise They used silly prediction rules (Pr(FEC) = 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
  36. 36. what went wrong? expertise They had study design problems (Batch effects)
  37. 37. what went wrong? expertise Their predictions weren’t locked down Today: Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
  38. 38. At the end of the day the Potti analysis was fully reproducible The problem is that the analysis was wrong
  39. 39. @leekgroup http://bit.ly/10vS1yt
  40. 40. @leekgroup http://bit.ly/OgW3xv
  41. 41. @leekgroup Drinkel et al. Oganometalics 2013
  42. 42. @leekgroup
  43. 43. @leekgroup
  44. 44. @leekgroup
  45. 45. @leekgroup
  46. 46. http://simplystatistics.tumblr.com/post/19646774024/laws-of-nature-and-the-law-of-patents-supreme-court
  47. 47. 3motivation
  48. 48. $(from reducing sample size)
  49. 49. basic idea randomization isn’t perfect “rebalance” with baseline covariates improve estimator precision
  50. 50. Ack Math!!!!
  51. 51. Estimate probability of being in arm given baseline covariates
  52. 52. Calculate initial estimate for each person using each arm model using propensity score weighted logistic regression
  53. 53. Define a covariate as the residual from fitting the arm-level models minus the arm-level means and fit new propensity models
  54. 54. Use these propensities to re-fit WLR from (2), then average predictions to get covariate-adjusted treatment effect
  55. 55. @leekgroup http://astor.som.jhmi.edu/~marchion//breastTSP.html
  56. 56. @leekgroup Age, Tumor Size, Grade 5.1% Age, Tumor Size, Grade, ER Status 4.9% Mammaprint Risk Category (MRC) 5.4% Age, Tumor Size, Grade, ER Status, MRC 7.8%
  57. 57. @leekgroup Age, Tumor Size, Grade 5.1% Age, Tumor Size, Grade, ER Status 4.9% Mammaprint Risk Category (MRC) 5.4% Age, Tumor Size, Grade, ER Status, MRC 7.8% Age, Tumor Size, Grade, ER Status, TSP 6.2%
  58. 58. 3cost analyst variation motivation
  59. 59. acknowledgements Leek group Prasad Patil Leo Collado Torres Abhi Nellore Claire Ruberman Jack Fu Kai Kammers Collaborators Michael Rosenblum Benjamin Haibe-Kains P.O. Bachant-Winner Roger Peng
  60. 60. Prasad Patil http://www.biostat.jhsph.edu/~prpatil/
  61. 61. Links https://github.com/leekgroup/sig2trial http://jtleek.com/talks/

×