Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Selective and incremental re-computation in reaction to changes:
an exe...
2
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Data Science
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorith...
3
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Data Science over time
Big
Data
The Big
Analytics
Machine
“Valuable
Kno...
4
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Understanding change
Big
Data
The Big
Analytics
Machine
“Valuable
Knowl...
5
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Talk Outline
ReComp: selective re-computation to refresh outcomes in re...
6
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Case study 1: Flood modelling simulation
Simulation characteristics:
Pa...
7
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
When should we repeat an expensive simulation?
CityCat
Flood simulator
...
8
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Estimating the impact of a flood simulation
Suppose we are able to quan...
9
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Approach
1. Define input diff and output diff functions:
2. Define an i...
10
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Diff and impact functions
B: Buildings
L: other Land
H: hard surface
f...
11
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Tuning the threshold parameter
Ground data from all past re-computatio...
12
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Experimental results
13
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Summary of the approach
M
F
M’
True
F’
False
Ground
data
Tune
Target F...
14
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Talk Outline
ReComp: selective re-computation to refresh outcomes in r...
15
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Data Analytics enabled by Next Gen Sequencing
Genomics: WES / WGS, Var...
16
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Whole-exome variant calling pipeline
Van der Auwera, G. A., Carneiro, ...
17
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Expensive
Data stats per sample:
4 files per sample (2-lane, pair-end,...
19
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
SVI: Simple Variant Interpretation
Genomics: WES / WGS, Variant callin...
20
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Changes that affect variant interpretation
What changes:
- Improved se...
21
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Baseline: blind re-computation
Sparsity issue:
• About 500 executions
...
22
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Unstable
Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R....
23
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
FreeBayes vs SamTools vs GATK-Haplotype Caller
GATK: McKenna, A., Hann...
24
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Comparing three versions of Freebayes
Should we care about changes in ...
25
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Impact on SVI classification
Patient phenotypes: 7 Alzheimer’s, 9 FTD-...
26
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Changes: frequency / impact / cost
Change Frequency
Changeimpactonacoh...
27
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Changes: frequency / impact / cost
Change Frequency
Changeimpactonacoh...
28
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
When is ReComp effective?
29
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
The ReComp meta-process
Estimate impact of
changes
Select and
Enact
Re...
32
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
changes, data diff, impact
1) Observed change events:
(inputs, depende...
33
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Impact
Given P (fixed), a change in one of the inputs to P: C={xx’} a...
34
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Impact: importance and Scope
Scope: which cases are affected?
- Indivi...
35
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Approach – a combination of techniques
1. Partial re-execution
• Ident...
37
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Approach – a combination of techniques
1. Partial re-execution
2. Diff...
38
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Role of Workflow Provenance in partial re-run
User Execution
«Associat...
39
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
History DB: Workflow Provenance
Each invocation of an eSC workflow gen...
40
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
SVI as eScience Central workflow
Phenotype to genes
Variant selection
...
41
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
1. Partial re-execution
1. Change detection: A provenance fact indicat...
42
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Minimal sub-graphs in SVI
Change in
ClinVar
Change in
GeneMap
Overhead...
47
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Approach – a combination of techniques
1. Partial re-execution
2. Diff...
48
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Diff functions: example
ClinVar
1/2016
ClinVar
1/2017
diff
(unchanged)
49
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Compute difference sets – ClinVar
The ClinVar dataset: 30 columns
Chan...
50
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
For tabular data, difference is just Select-Project
Key columns: {"#Al...
51
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Differential execution
ClinVar
1/2016
ClinVar
1/2017
diff
(unchanged)
52
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Differential execution
Suppose D is a relation (a table). diffD() can ...
53
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Partial re-computation using input difference
Idea: run SVI but replac...
54
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Approach – a combination of techniques
1. Partial re-execution
2. Diff...
55
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
3: precisely identify the scope of a change
Patient / DB version
impac...
56
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
A weak scoping algorithm
Coarse-grained
provenance
Candidate invocatio...
57
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Scoping: precision
• The approach avoids the majority of re-computatio...
58
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Summary of ReComp challenges
Change
Events
History
DB
Reproducibility:...
59
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Come to our workshop during Provenance Week!
https://sites.google.com/...
60
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Questions?
http://recomp.org.uk/
Meta-*
61
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
The Metadata Analytics challenge:
Learning from a metadata DB of execu...
62
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
History Database
HDB: A metadata-database containing records of past e...
63
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Impact (again)
Given P (fixed), a change in one of the inputs to P: C=...
64
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
ReComp decisions
Given a population X of prior inputs:
Given a change
...
65
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Two possible approaches
1. Direct estimator of impact function:
Here t...
66
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
History DB and Differences DB
Whenever P is re-computed on input X, a ...
67
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
ReComp algorithm
ReComp
C
X
E: HDB DDB
decisions:
E’: HDB’DDB’
68
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Learning challenges
• Evidence is small and sparse
• How can it be use...
Upcoming SlideShare
Loading in …5
×

of

Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 1 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 2 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 3 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 4 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 5 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 6 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 7 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 8 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 9 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 10 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 11 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 12 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 13 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 14 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 15 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 16 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 17 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 18 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 19 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 20 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 21 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 22 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 23 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 24 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 25 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 26 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 27 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 28 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 29 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 30 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 31 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 32 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 33 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 34 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 35 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 36 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 37 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 38 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 39 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 40 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 41 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 42 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 43 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 44 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 45 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 46 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 47 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 48 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 49 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 50 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 51 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 52 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 53 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 54 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 55 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 56 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 57 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 58 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 59 Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics Slide 60
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics

Download to read offline

Invited research seminar given at Durham University, Computer Science about findings from the Recomp project http://recomp.org.uk/

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics

  1. 1. 1 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics recomp.org.uk Paolo Missier, Jacek Cala, Jannetta Steyn School of Computing Newcastle University, UK Durham University May 31st, 2018 Meta-* In collaboration with • Institute of Genetic Medicine, Newcastle University • School of GeoSciences, Newcastle University
  2. 2. 2 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Data Science Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge”
  3. 3. 3 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Data Science over time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Life Science Analytics
  4. 4. 4 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Understanding change Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t • Threats: Will any of the changes invalidate prior findings? • Opportunities: Can the findings be improved over time? ReComp space = expensive analysis + frequent changes + high impact Analytics within ReComp space… C1: are resource-intensive and thus expensive when repeatedly executed over time, i.e., on a cloud or HPC cluster; C2: require sophisticated implementations to run efficiently, such as workflows with a nested structure; C3: depend on multiple reference datasets and software libraries and tools, some of which are versioned and evolve over time; C4: apply to a possibly large population of input instances C5: deliver valuable knowledge
  5. 5. 5 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Talk Outline ReComp: selective re-computation to refresh outcomes in reaction to change • Case study 1: Re-computation decisions for flood simulations • Learning useful estimators for the impact of change • Black box computation, coarse-grained changes • Case study 2: high throughput genomics data processing • An exercise in provenance collection and analytics • White-box computation, fine-grained changes • Open challenges
  6. 6. 6 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Case study 1: Flood modelling simulation Simulation characteristics: Part of Newcastle upon Tyne DTM: ≈2.3M cells, 2x2m cell size Building and green areas from Nov 2017 Rainfall event with return period 50 years Simulation time: 60 mins 10–25 frames with water depth and velocity in each cell Output size: 23x65 MiB ≈ 1.5 GiB Water depth heat map City Catchment Analysis Tool (CityCAT) Vassilis Glenis, et al. School of Engineering, NU
  7. 7. 7 ReComp–DurhamUniversityMay31st,2018 PaoloMissier When should we repeat an expensive simulation? CityCat Flood simulator CityCat Flood simulator Can we predict high difference areas without re- running the simulation? New buildings / green areas may alter data flow Extreme weather event simulation (in Newcastle) Extreme Rainfall event Running CityCat is generally expensive: - Processing for the Newcastle area: ≈3h on a 4-core i7 3.2GHz CPU Placeholder for more expensive simulations! Maps updates are infrequent (6 months) But useful when simulating changes eg for planning purposes Flood Diffusion Time series
  8. 8. 8 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Estimating the impact of a flood simulation Suppose we are able to quantify: - Difference in inputs, M,M’ - Difference in outputs F,F’ Suppose also that we are only interested in large enough changes between two outputs: For some user-defined parameter Problem statement: Can we define an ideal ReComp decision function which - Operates on two versions of the inputs, M, M’, and old output F - Returns true iff (1) would return true when F’ is actually computed (1) Can we predict when F’ needs to be computed?
  9. 9. 9 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach 1. Define input diff and output diff functions: 2. Define an impact function: 3. Define the ReComp decision function: where is a tunable parameter ReComp approximates (1), so it’s subject to errors: False Positives: False Negatives: 4. Use ground data to determine values for as a function of FPR and FNR Note: The ReComp function should be much less expensive to compute than sim()
  10. 10. 10 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Diff and impact functions B: Buildings L: other Land H: hard surface f() partitions polygons changes into 6 types: For each type, compute the average water depth within and around the footprint of the change returns the max of the avg water depth over all changes d Water depth B–L+ B–∩ L+ d B– Water depth : max of the differences between spatially averaged F,F’ over window W
  11. 11. 11 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Tuning the threshold parameter Ground data from all past re-computations: FP: <1,0> FN: <0,1> Set FNR to be close to 0. Experimentally find that minimises FPR. (max specificity) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.10 0.15 0.20 0.25 θImp 0.10 0.15 0.20 0.25 θImp Precision Recall Accuracy Specificity window size 20x20m, θO = 0.2m, all changes window size 20x20m, θO = 0.2m, consecutive changes
  12. 12. 12 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Experimental results
  13. 13. 13 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Summary of the approach M F M’ True F’ False Ground data Tune Target FPR Historical data <M,M’,F,F’>
  14. 14. 14 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Talk Outline ReComp: selective re-computation to refresh outcomes in reaction to change • Case study 1: Re-computation decisions for flood simulations • Learning useful estimators for the impact of change • Case study 2: high throughput genomics data processing • An exercise in provenance collection and analytics • Open challenges
  15. 15. 15 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Data Analytics enabled by Next Gen Sequencing Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP Submission of sequence data for archiving and analysis Data analysis using selected EBI and external software tools Data presentation and visualisation through web interface Visualisation raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Metagenomics: Species identification - Eg The EBI metagenomics portal
  16. 16. 16 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Whole-exome variant calling pipeline Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43 GATK quality score recalibration Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations BWA, Bowtie, Novoalign Picard: MarkDuplicates GATK-Haplotype Caller FreeBayes SamTools Variant recalibration
  17. 17. 17 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Expensive Data stats per sample: 4 files per sample (2-lane, pair-end, reads) ≈15 GB of compressed text data (gz) ≈40 GB uncompressed text data (FASTQ) Usually 30-40 input samples 0.45-0.6 TB of compressed data 1.2-1.6 TB uncompressed Most steps use 8-10 GB of reference data Small 6-sample run takes about 30h on the IGM HPC machine (Stage1+2) Scalable and Efficient Whole-exome Data Processing Using Workflows on the Cloud. Cala, J.; Marei, E.; Yu, Y.; Takeda, K.; and Missier, P. Future Generation Computer Systems, Special Issue: Big Data in the Cloud, 2016
  18. 18. 19 ReComp–DurhamUniversityMay31st,2018 PaoloMissier SVI: Simple Variant Interpretation Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Filters then classifies variants into three categories: pathogenic, benign and unknown/uncertain SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
  19. 19. 20 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Changes that affect variant interpretation What changes: - Improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources Evolution in number of variants that affect patients (a) with a specific phenotype (b) Across all phenotypes
  20. 20. 21 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Baseline: blind re-computation Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected: 4.2 hours of computation per change ≈7 minutes / patient (single-core VM) Should we care about database updates?
  21. 21. 22 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Unstable Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43 GATK quality score recalibration Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations BWA, Bowtie, Novoalign Picard: MarkDuplicates GATK-Haplotype Caller FreeBayes SamTools Variant recalibration dbSNP builds 150 2/17 149 11/16 148 6/16 147 4/16 Any of these stages may change over time – semi-independently Human reference genome: H19  h37, h38,…
  22. 22. 23 ReComp–DurhamUniversityMay31st,2018 PaoloMissier FreeBayes vs SamTools vs GATK-Haplotype Caller GATK: McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–303. https://doi.org/10.1101/gr.107524.110 FreeBayes: Garrison, Erik, and Gabor Marth. "Haplotype-based variant detection from short-read sequencing." arXiv preprint arXiv:1207.3907 (2012). GIAB: Zook, J. M., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W., & Salit, M. (2014). Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotech, 32(3), 246–251. http://dx.doi.org/10.1038/nbt.2835 Adam Cornish and Chittibabu Guda, “A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference,” BioMed Research International, vol. 2015, Article ID 456479, 11 pages, 2015. doi:10.1155/2015/456479 Hwang, S., Kim, E., Lee, I., & Marcotte, E. M. (2015). Systematic comparison of variant calling pipelines using gold standard personal exome variants. Scientific Reports, 5(December), 17875. https://doi.org/10.1038/srep17875
  23. 23. 24 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Comparing three versions of Freebayes Should we care about changes in the pipeline? • Tested three versions of the caller: • 0.9.10  Dec 2013 • 1.0.2  Dec 2015 • 1.1  Nov 2016 • The Venn diagram shows quantitative comparison (% and number) of filtered variants; • Phred quality score >30 • 16 patient BAM files (7 AD, 9 FTD-ALS)
  24. 24. 25 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Impact on SVI classification Patient phenotypes: 7 Alzheimer’s, 9 FTD-ALS The ONLY change in the pipeline is the version of Freebayes used to call variants (R)ed – confirmed pathogenicity (A)mber – uncertain pathogenicity Patient ID Freebayes version B_0190 B_0191 B_0192 B_0193 B_0195 B_0196 B_0198 B_0199 B_0201 B_0202 B_0203 B_0208 B_0209 B_0211 B_0213 B_0214 0.9.10 A A R A R R R R R A R R R R A R 1.0.2 A A R A R R A A R A R A R A A R 1.1 A A R A R R A A R A R A R A A R Phenotype ALS-FTD ALS-FTD ALS-FTD ALS-FTD ALS-FTD ALS-FTD AD ALS-FTD AD AD AD AD AD ALS-FTD ALS-FTD AD
  25. 25. 26 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Changes: frequency / impact / cost Change Frequency Changeimpactonacohort GATK Variant annotations (Annovar) Reference Human genome Variant DB (eg ClinVar) Phenotype  disease mapping (eg OMIM GeneMap) New sequences LowHigh Low High Variant Caller Variant calling N+1 problem Variant interpretation
  26. 26. 27 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Changes: frequency / impact / cost Change Frequency Changeimpactonacohort GATK Variant annotations (Annovar) Reference Human genome Variant DB (eg ClinVar) Phenotype  disease mapping (eg OMIM GeneMap) New sequences LowHigh Low High Variant Caller Variant calling N+1 problem Variant interpretation ReComp space
  27. 27. 28 ReComp–DurhamUniversityMay31st,2018 PaoloMissier When is ReComp effective?
  28. 28. 29 ReComp–DurhamUniversityMay31st,2018 PaoloMissier The ReComp meta-process Estimate impact of changes Select and Enact Record execution history Detect and measure changes History DB Data diff(.,.) functions Change Events Process P Observe Exec 1. Capture the history of past computations: - Process Structure and dependencies - Cost - Provenance of the outcomes 2. Metadata analytics: Learn from history - Estimation models for impact, cost, benefits Approach: 2. Collect and exploit process history metadata 1. Quantify data-diff and impact of changes on prior outcomes Changes: • Algorithms and tools • Accuracy of input sequences • Reference databases (HGMD, ClinVar, OMIM GeneMap…)
  29. 29. 32 ReComp–DurhamUniversityMay31st,2018 PaoloMissier changes, data diff, impact 1) Observed change events: (inputs, dependencies, or both) 3) Impact occurs to various degree on multiple prior outcomes. Impact of change C on the processing of a specific X: 2) Type-specific Diff functions: Impact is process- and data-specific:
  30. 30. 33 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Impact Given P (fixed), a change in one of the inputs to P: C={xx’} affects a single output: However a change in one of the dependencies: C= {dd’} affects all outputs yt where version d of D was used
  31. 31. 34 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Impact: importance and Scope Scope: which cases are affected? - Individual variants have an associated phenotype. - Patient cases also have a phenotype “a change in variant v can only have impact on a case X if V and X share the same phenotype” Importance: “Any variant with status moving from/to Red causes High impact on any X that is affected by the variant”
  32. 32. 35 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach – a combination of techniques 1. Partial re-execution • Identify and re-enact the portion of a process that are affected by change 2. Differential execution • Input to the new execution consists of the differences between two versions of a changed dataset • Only feasible if some algebraic properties of the process hold 3. Identifying the scope of change – Loss-less • Exclude instances of the population that are certainly not affected
  33. 33. 37 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  34. 34. 38 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Role of Workflow Provenance in partial re-run User Execution «Association » «Usage» «Generation » « «C Controller Program Workflow Channel Port wasPartOf «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls[*] [*] [*] [*] [*] [*] [0..1] [0..1] [*][1] [*] [*] [0..1] [0..1] hasOutPort [*][0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*][0..1] connectsTo [*] [0..1] «wasInformedBy » [*][1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPorthadInPort [*][1] [1] [1] [1] hadEntity hasDefaultPara
  35. 35. 39 ReComp–DurhamUniversityMay31st,2018 PaoloMissier History DB: Workflow Provenance Each invocation of an eSC workflow generates a provenance trace “plan” “plan execution” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage ProgramWorkflow Execution Entity (ref data)
  36. 36. 40 ReComp–DurhamUniversityMay31st,2018 PaoloMissier SVI as eScience Central workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype
  37. 37. 41 ReComp–DurhamUniversityMay31st,2018 PaoloMissier 1. Partial re-execution 1. Change detection: A provenance fact indicates that a new version Dnew of database d is available wasDerivedFrom(“db”,Dnew) :- execution(WFexec), wasPartOf(Xexec,WFexec), used(Xexec, “db”) 2.1 Find the entry point(s) into the workflow, where db was used :- execution(WFexec), execution(B1exec), execution(B2exec), wasPartOf(B1exec, WFexec), wasPartOf(B2exec, WFexec), wasGeneratedBy(Data, B1exec), used(B2exec,Data) 2.2 Discover the rest of the sub-workflow graph (execute recursively) 2. Reacting to the change: Provenance pattern: “plan” “plan execution” Ex. db = “ClinVar v.x” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  38. 38. 42 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Minimal sub-graphs in SVI Change in ClinVar Change in GeneMap Overhead: cache intermediate data required for partial re-execution • 156 MB for GeneMap changes and 37 kB for ClinVar changes Time savings Partial re- execution (seC) Complete re- execution Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37
  39. 39. 47 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  40. 40. 48 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Diff functions: example ClinVar 1/2016 ClinVar 1/2017 diff (unchanged)
  41. 41. 49 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Compute difference sets – ClinVar The ClinVar dataset: 30 columns Changes: Records: 349,074  543,841 Added 200,746 Removed 5,979. Updated 27,662
  42. 42. 50 ReComp–DurhamUniversityMay31st,2018 PaoloMissier For tabular data, difference is just Select-Project Key columns: {"#AlleleID", "Assembly", "Chromosome”} “where” columns:{"ClinicalSignificance”}
  43. 43. 51 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Differential execution ClinVar 1/2016 ClinVar 1/2017 diff (unchanged)
  44. 44. 52 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Differential execution Suppose D is a relation (a table). diffD() can be expressed as: Where: We compute: as the combination of: This is effective if: This can be achieved as follows: …provided P is distributive wrt st union and difference Cf. F. McSherry, D. Murray, R. Isaacs, and M. Isard, “Differential dataflow,” in Proceedings of CIDR 2013, 2013.
  45. 45. 53 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Partial re-computation using input difference Idea: run SVI but replace ClinVar query with a query on ClinVar version diff: Q(CV)  Q(diff(CV1, CV2)) Works for SVI, but hard to generalise: depends on the type of process Bigger gain: diff(CV1, CV2) much smaller than CV2 GeneMap versions from –> to ToVersion record count Difference record count Reduction 16-03-08 –> 16-06-07 15910 1458 91% 16-03-08 –> 16-04-28 15871 1386 91% 16-04-28 –> 16-06-01 15897 78 99.5% 16-06-01 –> 16-06-02 15897 2 99.99% 16-06-02 –> 16-06-07 15910 33 99.8% ClinVar versions from –> to ToVersion record count Difference record count Reduction 15-02 –> 16-05 290815 38216 87% 15-02 –> 16-02 285042 35550 88% 16-02 –> 16-05 290815 3322 98.9%
  46. 46. 54 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  47. 47. 55 ReComp–DurhamUniversityMay31st,2018 PaoloMissier 3: precisely identify the scope of a change Patient / DB version impact matrix Strong scope: (fine-grained provenance) Weak scope: “if CVi was used in the processing of pj then pj is in scope” (coarse-grained provenance – next slide) Semantic scope: (domain-specific scoping rules)
  48. 48. 56 ReComp–DurhamUniversityMay31st,2018 PaoloMissier A weak scoping algorithm Coarse-grained provenance Candidate invocation: Any invocation I of P whose provenance contains statements of the form: used(A,”db”),wasPartOf(A,I),wasAssociatedWith(I,_,WF) - For each candidate invocation I of P: - partially re-execute using the difference sets as inputs # see previous slides - find the minimal subgraph P’ of P that needs re-computation # see above - repeat: execute P’ one step at-a-time until <empty output> or <P’ completed> - If <P’ completed> and not <empty output> then - Execute P’ on the full inputs Sketch of the algorithm: WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  49. 49. 57 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Scoping: precision • The approach avoids the majority of re-computations given a ClinVar change • Reduction in number of complete re-executions from 495 down to 71
  50. 50. 58 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Summary of ReComp challenges Change Events History DB Reproducibility: - virtualisation Sensitivity analysis unlikely to work well Small input perturbations  potentially large impact on diagnosis Learning useful estimators is hard Diff functions are both type- and application-specific Not all runtime environments support provenance recording specific  generic Data diff(.,.) functions Process P Observe Exec
  51. 51. 59 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Come to our workshop during Provenance Week! https://sites.google.com/view/incremental-recomp-workshop July 12th (pm) and 13th (am), King’s College London http://provenanceweek2018.org/
  52. 52. 60 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Questions? http://recomp.org.uk/ Meta-*
  53. 53. 61 ReComp–DurhamUniversityMay31st,2018 PaoloMissier The Metadata Analytics challenge: Learning from a metadata DB of execution history to support automated ReComp decisions
  54. 54. 62 ReComp–DurhamUniversityMay31st,2018 PaoloMissier History Database HDB: A metadata-database containing records of past executions: Execution records: C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB Example: Consider only one type of change: Variant caller
  55. 55. 63 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Impact (again) Given P (fixed), a change in one of the inputs to P: C={xx’} affects a single output: While a change in one of the dependencies: C= {dd’} affects all outputs yt where version d of D was used
  56. 56. 64 ReComp–DurhamUniversityMay31st,2018 PaoloMissier ReComp decisions Given a population X of prior inputs: Given a change ReComp makes yes/no decisions for each returns True if P is to be executed again on X, and False otherwise To decide, ReComp must estimate impact: (as well as estimate the re-computation cost) Example:
  57. 57. 65 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Two possible approaches 1. Direct estimator of impact function: Here the problem is learn such function for specific P, C, and data types Y 2. Learning an emulator for P which is simpler to compute and provides a useful approximation: surrogate (emulator) where ε is a stochastic term that accounts for the error in approximating f Learning requires a training set { (xi, yi) } … If can be found, then we can hope to use it to approximate: Such that
  58. 58. 66 ReComp–DurhamUniversityMay31st,2018 PaoloMissier History DB and Differences DB Whenever P is re-computed on input X, a new er’ is added to HDB for X: Using diff() functions we produce a derived difference record dr: … collected in a Differences database: dr1 = imp(C1,Y11) dr2= imp(C12,Y41) dr3 = imp(C1,Y51) dr4 = imp(C2,Y52) DDB C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB    
  59. 59. 67 ReComp–DurhamUniversityMay31st,2018 PaoloMissier ReComp algorithm ReComp C X E: HDB DDB decisions: E’: HDB’DDB’
  60. 60. 68 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Learning challenges • Evidence is small and sparse • How can it be used for selecting from X? • Learning a reliable imp() function is not feasible • What’s the use of history? You never see the same change twice! • Must somehow use evidence from related changes • A possible approach: • ReComp makes probabilistic decisions, takes chances • Associate a reward to each ReComp decision  reinforcement learning • Bayesian inference (use new evidence to update probabilities) dr1 = imp(C1,Y11) dr2= imp(C12,Y41) dr3 = imp(C1,Y51) dr4 = imp(C2,Y52) DDB C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB    

Invited research seminar given at Durham University, Computer Science about findings from the Recomp project http://recomp.org.uk/

Views

Total views

461

On Slideshare

0

From embeds

0

Number of embeds

171

Actions

Downloads

1

Shares

0

Comments

0

Likes

0

×