Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big biomedical data is a lie

163 views

Published on

Presented at Pharma Advanced Analytics, London 2018/1/31

Published in: Science
  • Be the first to comment

  • Be the first to like this

Big biomedical data is a lie

  1. 1. Big Biomedical Data is a Lie Taming large datasets for translational research Paul Agapow
 Data Science Institute Imperial College London <p.agapow@imperial.ac.uk> 2018/1/31
  2. 2. Disclosure / About me • Data Science Institute (Imperial College London) • Big rich biomedical datasets for translational research & precision medicine • Novel & advanced computation for research • No actual or potential conflict of interest in relation to this presentation
  3. 3. – An analyst “Nice training set. Where’s your data?”
  4. 4. Biomedical big data is often not big enough • Average trial size on ClinicalTrials.gov < 100 • Average #samples per GEO dataset < 100 • Average GWAS cohort size ~9000 (median ~2500) • 1,064 ICU admissions for flu in UK 2016/2017 season • Curse of dimensionality • Deep learning requires “thousands” of samples for training (at least p2?) • GWAS needs 3K+ for large effects, 10K or more for small effects … • Sub-populations will be smaller
  5. 5. Platforms are a problem not a panacea • Biomedical data lakes / warehouses aren’t working • Each is an island unto itself • Tools can’t understand data formats • High demands on user (meaning, context) • Poor standardisation / harmonisation tools (curation effort == analysis effort) • A world of distributed data • A world of many computational idioms • (Self) lock-in
  6. 6. Computers are not getting faster • Data is embiggening • Can’t rely on cheap computation to get us out of a hole • Many HPC idioms, most awkward (e.g. Map- Reduce) • Db schema struggle at scale
  7. 7. What if every gene effects every other gene? • Pritchard’s omnigenics (2017): • Kevin Bacon effect • Implicated genes are a few drivers and an enormous number of “related” loci • How do we pick the “important”genes?
  8. 8. Statisticians hate us • P-hacking • Garden of forking paths • Reversion to mean • Multiple hypothesis testing • False discovery • P-values • Which method is best?
  9. 9. In summary • Data isn’t big (enough) • Platforms are a problem • Computation isn’t saving us • Diseases are complicated • We don’t know what we’re doing
  10. 10. Solutions Responses
  11. 11. Allow bigger datasets • “Allow” reuse & combining not “build” • Assemble datasets according to standards (CDISC, EDAM, HPO) • Poor tools but getting better: trmk / Arborist, eHS • Issue of trust Your study data in Excel Import: start the import wizard to create a study based on your study data. Save: st tranSMA Load: us your da Your study l tmtk ⬆ Python library Send to the Arborist web application for easy collaboration! From Excel to tranSMART in five simple steps Try it at http://arborist-test-trait.thehyve.net/demo. Code at https://github.com/thehyve/arborist under GPL v3 license. 1 Validate: let the toolkit check the tranSMART-specific requirements. Edit: ma with the 2 The Arborist ⬇ Visual editor Collaborate on data modelling with non-technical data experts in the secure Arborist web application. ● Restructure the tranSMART tree with drag and drop ● Rename variables and values ● Add and edit metadata for any tree node ● Work with both low and high dimensional data tmtk notable python commands The main object in the tmtk workflow is the Study. It provides an API for modifying and
  12. 12. eTRIKS project • Via IMI: Europe’s largest public-private initiative • Data intensive translational research • Sharing data (standards, starter kit) • Open knowledge platform • Sustainable service
  13. 13. Example: U-BIOPRED • Unbiased BIOmarkers in PREDiction of respiratory disease outcomes • 900+ patients, 16 clinical centres + other studies combined via standards • Outputs: • Common tranSMART db • 40+ academic publications • Subtyping of asthmatics
  14. 14. Use your data better • Pre-training (data without labels) • Initial training with mediocre data • Adapt • Transfer learning (labels / output changes) • Domain adaptation (data / input changes) • Don’t use deep learning
  15. 15. Example: text extraction • Aim: extract biological relationships from publications to build asthma knowledge base • Using BEL statements • Domain expert time is prohibitive • Use previous efforts as training
  16. 16. Example: text classification for systematic reviews • Aim: find similar or related publications within corpus • Actual aim: find which which method of text classification is “best” (Validation) • Data: 15 Drug Control Reviews & Neuropathic Pain dataset • Classify with random forest, naive bayes, SVM & CNNs • Which has best recall?
  17. 17. When you don’t know what to use, use SVMs Conclusion Dataset WSS Classifier Dataset WSS Classifier ACE Inhibitors 0.26 SVM NSAIDS 0.14 SVM ADHD 0.35 MNB Opioids 0.23 SVM Antihistamines 0.19 MNB Oral Hypoglycemics 0.21 SVM Atypical Antipsychotics 0.12 SVM PPI 0.17 SVM Beta Blockers 0.13 SVM Skeletal Muscle Relaxants 0.21 SVM CCB 0.21 SVM Statins 0.19 SVM Estrogen 0.25 SVM Triptans 0.22 SVM Neuropathic Pain 0.61 CNN Urinary Incontinence 0.25 SVM
  18. 18. Not platforms but meta-platforms • The monolithic platform is dead • We live in a world of distributed data • Avoid lock-in • Don’t try to do everything • Interoperability • Allow different computational idioms
  19. 19. tranSMART redevelopment • eTRIKS enhancements • i2b2 merger • Next-generation tranSMART • Major refactoring & performance fixes • Additional tools & visualisation • Component architecture • Just a warehouse with API
  20. 20. Better HPC idioms • Spark • Map-Reduce but doesn’t persist back between steps • Better for iterative processing • Does less violence to problem • Graphs & ML
  21. 21. Example: Spark for clustering • Subtyping / stratification • Popular methods are computationally prohibitive on rich data • (Also ground truth unclear) • “Sparkify”, compare, validate on asthma cohort
  22. 22. Hypothesis generation vs validation • Generating leads vs. testing • Machine learning for: • hypothesis generation / exploration • streamlining of laborious manual tasks • Validate!
  23. 23. Conclusions • Big biomedical data is often not big, but we can make it bigger • We don't need more platforms, we need platforms that work together • Sometimes Big Data approaches are useful, sometimes not: choose wisely • Trust but verify (especially machine learning)
  24. 24. Thanks • Data Science Institute, ICL • Fayzal Ghantiwala (Bloomberg) • Nazanin Zounemat Kermani (ICL) • Mansoor Saqi (EISBM / ICL) • Jose Saray (EISBM) • eTRIKS consortium • U-BIOPRED consortium

×