Our vision for the selective re-computation of genomics pipelines in reaction to changes to tools and reference datasets.
How do you prioritise patients for re-analysis on a given budget?
1. ReComp for genomics
Our Vision:
selective re-computation of genomics pipelines
in reaction to changes
Nov, 2016
Dr. Paolo Missier
School of Computing Science
Newcastle University
2. Data Analytics enabled by NGS
Genomics: WES / WGS, Variant calling, Variant interpretation diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
Submission of
sequence data for
archiving and analysis
Data analysis using
selected EBI and
external software tools
Data presentation and
visualisation through
web interface
Visualisation
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Metagenomics: Species identification
- Eg The EBI metagenomics portal
3. Understanding change: threats and opportunities
Big
Data
Life Sciences
Analytics
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
Key questions for the ReComp project:
• Threats: Will any of the changes invalidate prior findings?
• Opportunities: Can the findings from the pipelines be improved over time?
• Cost: Need to model future costs based on past history and pricing trends for virtual appliances
• Impact:
• Which patients/samples are likely to be affected?
• How do we estimate the potential benefits on affected patients?
• Re-computations are expensive. Can we estimate the impact of these changes without re-
computing entire cohorts?
Many of the elements involved in producing
analytical knowledge change over time:
• Algorithms and tools
• Accuracy of input sequences
• Reference databases (HGMD, ClinVar,
OMIM GeneMap, GeneCard,…)
4. The ReComp vision
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Big
Data
Life Sciences
Analytics
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
ReComp:
a decision support system for selectively re-computing complex analytics in reaction
to change
- Generic: not just for the life sciences!
- Customisable: eg for genomics pipelines
5. Approach and challenges
Challenges:
1. Learning from history and optimisation:
• What types of meta-knowledge needs to be captured, and how much history is required to make
optimal re-computation decisions?
• Can we use history to learn estimates of impact without the need for actual re-computation?
2. Software infrastructure and tooling
ReComp aims to deliver a metadata management and analytics stack
3. Reproducibility:
How do we ensure that the “ReComp” button will actually performe a valid re-computation?
4. Impact:
Which areas of genomics and more broadly bioinformatics can benefit from ReComp?
Approach: It’s all in the meta-data!
1. History of past computations. Capture details of analytics tasks and their executions:
- Structure and dependencies of the process
- Cost
- Provenance of the outcomes
2. Metadata analytics: Learn from history
- Estimation models for impact, cost, benefits
6. Project structure
• 3 years funding from the EPSRC (£585,000 grant) on the Making Sense from Data call
• Feb. 2016 - Jan. 2019
• 2 RAs fully employed in Newcastle
• PI: Dr. Missier, School of Computing Science, Newcastle University (30%)
• CO-Investigators (8% each):
• Prof. Watson, School of Computing Science, Newcastle University
• Prof. Chinnery, Department of Clinical Neurosciences, Cambridge University
• Dr. Phil James, Civil Engineering, Newcastle University
Builds upon the experience of the Cloud-e-Genome project: 2013-2015
Aims:
- To demonstrate cost-effective workflow-based processing of NGS pipelines on the cloud
- To facilitate the adoption of reliable genetic testing in clinical practice
- A collaboration between the Institute of Genetic Medicine and the School of Computing
Science at Newcastle University
- Funding: NIHR / Newcastle BRC (£180,000) plus $40,000 Microsoft Research grant “Azure
for Research”