Advertisement
Advertisement

More Related Content

Similar to Analysing biomedical data (ers october 2017)(20)

Advertisement

Analysing biomedical data (ers october 2017)

  1. Analysing biomedical data Paul Agapow / Translational Bioinformatics DSI-ICL / October 2017
  2. Biomedical science is now data science
  3. The 4-headed beast ● The 4 heads ○ Acquisition ○ Storage ○ Analysis ○ Sharing ● Big data 4Vs ○ Velocity ○ Volume ○ Variety ○ Veracity
  4. The problems of biomedical data Many ... ● Types ● Formats ● Silos ● Gaps ● Interactions Difficult analysis ● The curse of dimensionality ● Multiple hypothesis testing & false discovery ● Batch effects ● Life history ● Biased sampling ● Need for integrative analysis Practical issues ● Unstructured data ● Managing big data ● Security ● Legal & privacy
  5. Future medicine A mix of promise & peril ● More data ○ Genomic medicine ○ Other “omic” medicine ○ Wearables ○ EHR & digital health ● P4 medicine ○ Stratification ○ Analysis at the bedside ○ Patient participation ● Translational medicine ○ Leveraging health data for research
  6. Scientific data doubles every 18 months A new paper is published every 30 seconds Most papers are never cited or even read No new principle will declare itself from below a heap of facts. (Peter Medewar)
  7. The analytical challenges
  8. Liberating health data ● Enabling EHR for research ● Text extraction ● Unstructured to structured data
  9. Computationally intensive approaches ● Deep learning ● Concurrent computation ● Which to use? Which works? ● Implementation ● Interpretation ● Assisted & auto-discovery
  10. Integrative analysis ● The genome is not enough ● Complex interactions ● Statistical power ● Which is best? ● Interpretation
  11. Building knowledge bases ● Extracting structured information from unstructured input ● Veracity ● Exploring / querying
  12. Reproducibility
  13. The “solutions”
  14. Standards ● Clinical descriptions ● Measurements: ○ blood pressure ○ White cells ● Cross-study Yes! ● Allows combining & comparing studies ● CDISC ● HPO But! ● A lot of work
  15. Data formats & storageYes! ● Plain text ● Open formats ● Structured formats ● Advantages: ○ Human & machine readable ○ Unambiguous ○ WYSIWYG ● Examples: ○ Open bio formats ○ CSV, TSV No! ● Homebrew formats ● Proprietary / closed formats ● Binary formats ● Excel
  16. Workflow systems & notebooks● Analysis as: ○ An executable recipe ○ A document or commentary ● Many candidates: ○ Workflows: ■ Snakemake ■ Nextflow ■ CWL etc. ○ Computational notebooks: ■ Jupyter / IPython ■ RMarkdown
  17. Deep learning / machine learning How do you know a biologist is using deep learning in their research? Don’t worry, they’ll tell you. ● “Just” optimization and search techniques ● Takes a set of features and produces a model that performs a classification or a regression ● A series of layers that assemble features into higher level features ● Several high-quality toolkits ● Some need for specialised hardware (GPU) ● Interpretability ● Ground truths ● Needs lots of data
  18. The pitfalls
  19. Batch effects ● Technical sources of variation ○ Reageants ○ Technician ○ Platform ○ ... ● Solutions: ○ Plot data ○ Don’t batch ○ COMBAT etc. (but loss of information)
  20. Omnigenics What if every gene affected every other gene? ● Pritchard et al 2017 ● FOAF / six degrees of separation effect ● Implicated genes are a few drivers and an enormous number of “related” loci ● Context?
  21. The garden of forking paths Multiple hypothesis testing
  22. Conclusion Taming the 4-headed beast Acquiring: interpret EHR Storing: data formats & systems Analysing: statistics, correct for batch effects, integrative analysis, deep learning Sharing: standards, data formats, workflow systems

Editor's Notes

  1. Scientific data doubles every 18 months A new paper is published every 30 seconds Most papers are never cited (or even read)
Advertisement