Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ReComp for genomics


Published on

Our vision for the selective re-computation of genomics pipelines in reaction to changes to tools and reference datasets.
How do you prioritise patients for re-analysis on a given budget?

Published in: Technology
  • Be the first to comment

  • Be the first to like this

ReComp for genomics

  1. 1. ReComp for genomics Our Vision: selective re-computation of genomics pipelines in reaction to changes Nov, 2016 Dr. Paolo Missier School of Computing Science Newcastle University
  2. 2. Data Analytics enabled by NGS Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP Submission of sequence data for archiving and analysis Data analysis using selected EBI and external software tools Data presentation and visualisation through web interface Visualisation raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Metagenomics: Species identification - Eg The EBI metagenomics portal
  3. 3. Understanding change: threats and opportunities Big Data Life Sciences Analytics “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Key questions for the ReComp project: • Threats: Will any of the changes invalidate prior findings? • Opportunities: Can the findings from the pipelines be improved over time? • Cost: Need to model future costs based on past history and pricing trends for virtual appliances • Impact: • Which patients/samples are likely to be affected? • How do we estimate the potential benefits on affected patients? • Re-computations are expensive. Can we estimate the impact of these changes without re- computing entire cohorts? Many of the elements involved in producing analytical knowledge change over time: • Algorithms and tools • Accuracy of input sequences • Reference databases (HGMD, ClinVar, OMIM GeneMap, GeneCard,…)
  4. 4. The ReComp vision Observe change • In big data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Big Data Life Sciences Analytics “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t ReComp: a decision support system for selectively re-computing complex analytics in reaction to change - Generic: not just for the life sciences! - Customisable: eg for genomics pipelines
  5. 5. Approach and challenges Challenges: 1. Learning from history and optimisation: • What types of meta-knowledge needs to be captured, and how much history is required to make optimal re-computation decisions? • Can we use history to learn estimates of impact without the need for actual re-computation? 2. Software infrastructure and tooling ReComp aims to deliver a metadata management and analytics stack 3. Reproducibility: How do we ensure that the “ReComp” button will actually performe a valid re-computation? 4. Impact: Which areas of genomics and more broadly bioinformatics can benefit from ReComp? Approach: It’s all in the meta-data! 1. History of past computations. Capture details of analytics tasks and their executions: - Structure and dependencies of the process - Cost - Provenance of the outcomes 2. Metadata analytics: Learn from history - Estimation models for impact, cost, benefits
  6. 6. Project structure • 3 years funding from the EPSRC (£585,000 grant) on the Making Sense from Data call • Feb. 2016 - Jan. 2019 • 2 RAs fully employed in Newcastle • PI: Dr. Missier, School of Computing Science, Newcastle University (30%) • CO-Investigators (8% each): • Prof. Watson, School of Computing Science, Newcastle University • Prof. Chinnery, Department of Clinical Neurosciences, Cambridge University • Dr. Phil James, Civil Engineering, Newcastle University Builds upon the experience of the Cloud-e-Genome project: 2013-2015 Aims: - To demonstrate cost-effective workflow-based processing of NGS pipelines on the cloud - To facilitate the adoption of reliable genetic testing in clinical practice - A collaboration between the Institute of Genetic Medicine and the School of Computing Science at Newcastle University - Funding: NIHR / Newcastle BRC (£180,000) plus $40,000 Microsoft Research grant “Azure for Research”