-
1.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
1
Preserving the currency of genomics outcomes over time through
selective re-computation: models and initial findings
recomp.org.uk
Paolo Missier, Jacek Cala, Jannetta Steyn
School of Computing
Newcastle University, UK
14th Annual Meeting of the Bioinformatics Italian Society
Cagliari, Italy
July, 2017
(*) Painting by Johannes Moreelse
(*)
Panta Rhei (Heraclitus)
-
2.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
2
Data Science
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
-
3.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
3
Data Science over time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
Life Science
Analytics
-
4.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
4
Talk Outline
• The importance of quantifying changes to meta-knowledge, and
their impact
• Cloud e-Genome: WES data processing on the cloud using
workflow technology
• ReComp: selective re-analysis of cases in reaction to change
• Techniques and initial findings
• Open challenges
-
5.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
5
Data Analytics enabled by NGS
Genomics: WES / WGS, Variant calling, Variant interpretation diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
Submission of
sequence data for
archiving and analysis
Data analysis using
selected EBI and
external software tools
Data presentation and
visualisation through
web interface
Visualisation
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Metagenomics: Species identification
- Eg The EBI metagenomics portal
-
6.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
6
SVI: Simple Variant Interpretation
Genomics: WES / WGS, Variant calling, Variant interpretation diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Filters then classifies variants into three categories: pathogenic,
benign and unknown/uncertain
SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya,
E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences,
Los Angeles, CA, 2015. Springer
-
7.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
7
Changes that affect variant interpretation
What changes:
- Improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources
Evolution in number of variants that affect patients
(a) with a specific phenotype
(b) Across all phenotypes
-
8.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
8
Baseline: blind re-computation
Sparsity issue:
• About 500 executions
• 33 patients
• total runtime about 60 hours
• Only 14 relevant output changes detected:
4.2 hours of computation per change
≈7 minutes / patient (single-core VM)
-
9.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
9
1. Whole-exome variant calling
GATK quality
score
recalibration
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
BWA, Bowtie,
Novoalign
PicardMark
Duplicates
GATK-Haplotype Caller
FreeBayes
SamTools
Variant
recalibration
Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From
FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in
Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43
dbSNP builds
150 2/17
149 11/16
148 6/16
147 4/16
-
10.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
10
FreeBayes vs SamTools vs GATK-HC
GATK: McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … DePristo, M.
A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA
sequencing data. Genome Research, 20(9), 1297–303. https://doi.org/10.1101/gr.107524.110
FreeBayes: Garrison, Erik, and Gabor Marth. "Haplotype-based variant detection from short-read
sequencing." arXiv preprint arXiv:1207.3907 (2012).
GIAB: Zook, J. M., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W., & Salit, M. (2014).
Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype
calls. Nat Biotech, 32(3), 246–251. http://dx.doi.org/10.1038/nbt.2835
Adam Cornish and Chittibabu Guda, “A Comparison of Variant Calling Pipelines Using Genome in a
Bottle as a Reference,” BioMed Research International, vol. 2015, Article ID 456479, 11 pages, 2015.
doi:10.1155/2015/456479
Hwang, S., Kim, E., Lee, I., & Marcotte, E. M. (2015). Systematic comparison of variant calling
pipelines using gold standard personal exome variants. Scientific Reports, 5(December), 17875.
https://doi.org/10.1038/srep17875
-
11.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
11
Our study: comparing three versions of Freebayes
• Tested three versions of the caller:
• 0.9.10 Dec 2013
• 1.0.2 Dec 2015
• 1.1 Nov 2016
• The Venn diagram shows quantitative comparison (% and number) of filtered variants;
• Phred quality score >30
• 16 patient BAM files (7 AD, 9 FTD-ALS)
-
12.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
12
Impact on SVI classification
Patient phenotypes: 7 Alzheimer’s, 9 FTD-ALS
The ONLY change in the pipeline is the version of Freebayes used to call variants
The table shows the final SVI classification
(R)ed – confirmed pathogenicity
(A)mber – uncertain pathogenicity
Patient ID
Freebayes
version
B_0190
B_0191
B_0192
B_0193
B_0195
B_0196
B_0198
B_0199
B_0201
B_0202
B_0203
B_0208
B_0209
B_0211
B_0213
B_0214
0.9.10 A A R A R R R R R A R R R R A R
1.0.2 A A R A R R A A R A R A R A A R
1.1 A A R A R R A A R A R A R A A R
Phenotype
ALS-FTD
ALS-FTD
ALS-FTD
ALS-FTD
ALS-FTD
ALS-FTD
AD
ALS-FTD
AD
AD
AD
AD
AD
ALS-FTD
ALS-FTD
AD
In four cases change in the caller version changes the classification
-
13.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
14
Changes: frequency vs impact
Change Frequency
Changeimpactonacohort
GATK
Variant annotations
(Annovar)
Reference
Human genome
Variant DB
(eg ClinVar)
Phenotype
disease mapping
(eg OMIM
GeneMap)
New
sequences
LowHigh
Low High
Variant
Caller
Variant calling
N+1 problem
Variant interpretation
-
14.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
15
Changes: frequency vs impact
Change Frequency
Changeimpactonacohort
GATK
Variant annotations
(Annovar)
Reference
Human genome
Variant DB
(eg ClinVar)
Phenotype
disease mapping
(eg OMIM
GeneMap)
New
sequences
LowHigh
Low High
Variant
Caller
Variant calling
N+1 problem
Variant interpretation
ReComp space
-
15.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
16
Understanding change
Big
Data
Life Sciences
Analytics
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
• Threats: Will any of the changes invalidate prior findings? (diagnoses)
• Opportunities: Can the findings (diagnoses) be improved over time?
• Impact: Which patients/samples are going to be affected? To what extent?
Many of the elements involved in producing
analytical knowledge change over time:
• Algorithms and tools
• Accuracy of input sequences
• Reference databases (HGMD, ClinVar,
OMIM GeneMap, GeneCard,…)
ReComp space = expensive analysis + frequent changes + high impact
-
16.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
17
Talk Outline
• The importance of quantifying changes to meta-knowledge, and
their impact
• Cloud e-Genome: WES data processing on the cloud using
workflow technology
• ReComp: selective re-analysis of cases in reaction to change
• Techniques and initial findings
• Open challenges
-
17.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
20
WES pipeline: scale
Data stats per sample:
4 files per sample (2-lane, pair-end,
reads)
≈15 GB of compressed text data (gz)
≈40 GB uncompressed text data
(FASTQ)
Usually 30-40 input samples
0.45-0.6 TB of compressed data
1.2-1.6 TB uncompressed
Most steps use 8-10 GB of
reference data
Small 6-sample run takes
about 30h on the IGM HPC
machine (Stage1+2)
-
18.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
22
Workflow Design
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP
mkdir -p $PICARD_OUTDIR
mkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...
$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...
$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT =
$SORTED_BAM_FILE_NODUPS_NO_RG
METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...
$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG
OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID
RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...
samtools index $SORTED_BAM_FILE_NODUPS
“Wrapper”
blocksUtility
blocks
-
19.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
23
Workflow design
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Conceptual:
Actual:
11 workflows
101 blocks
28 tool blocks
-
20.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
25
Parallelism in the pipeline
Chr1 Chr2 ChrM
Chr1 Chr2 ChrM
Chr1 Chr2 ChrM
align, clean,
recalibrate
call variants
annotate
align, clean,
recalibrate
align, clean,
recalibrate
Stage 1 Stage 2 Stage 3
annotate
annotate
call variants
call variants
Chr1
Chr1
Chr1
Chr2
Chr2
Chr2
ChrM
ChrM
ChrM
chromosomesplit
samplesplit
chromosomesplit
samplesplit
Sample 1
Sample 2
Sample N
Annotated
variants
Annotated
variants
Annotated
variants
align-clean-
recalibrate-coverage
…
align-clean-
recalibrate-coverage
Sample
1
Sample
n
Variant calling
recalibration
Variant calling
recalibration
Variant filtering
annotation
Variant filtering
annotation
……
Chromosome
split
Per-sample
Parallel
processing
Per-chromosome
Parallel
processing
Stage I Stage II Stage III
-
21.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
27
Workflow on Azure Cloud – modular configuration
<<Azure VM>>
Azure Blob
store
e-SC db
backend
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow engines Module
configuration:
3 nodes, 24 cores
-
22.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
28
Performance
Configurations for 3VMs experiments:
HPC cluster (dedicated nodes):
3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU
48 GiB RAM, 160 GB scratch space
Azure workflow engines:
D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04.
00:00
12:00
24:00
36:00
48:00
60:00
72:00
0 6 12 18 24
Responsetime[hh:mm]
Number of samples
3 eng (24 cores) 6 eng (48 cores)
12 eng (96 cores)
-
23.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
29
Comparison with HPC
0
24
48
72
96
120
144
168
0 6 12 18 24
Responsetime[hours]
Number of input samples
HPC (3 compute nodes) Azure (3xD13 – SSD) – sync Azure (3xD13 – SSD) – chained
Scalable and Efficient Whole-exome Data Processing Using Workflows on the Cloud. Cala, J.;
Marei, E.; Yu, Y.; Takeda, K.; and Missier, P. Future Generation Computer Systems, Special Issue:
Big Data in the Cloud, 2016
-
24.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
32
Cost
A 6 engine configuration achieves near-optimal cost/sample
0 50 100 150 200 250 300 350
0
0.2
0.4
0.6
0.8
1
1.2
0 6 12 18 24
0
2
4
6
8
10
12
14
16
18
Size of the input data [GiB]
CostperGiB[£]
Number of samples
Costpersample[£]
3 eng (24 cores)
6 eng (48 cores)
12 eng (96 cores)
-
25.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
33
What about flexibility? GATK-HC FreeBayes
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
align lane
recalibrate
sample
<<Exec>>
call
variants
<<Exec>>
recalibrate
variants
align
sample
haplotype
caller
recalibrate
sample
raw
sequences <<Exec>>
align
<<Exec>>
clean
<<Exec>>
calculate
coverage
coverage
information
<<Exec>>
recalibrate
alignments
<<Exec>>
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
<<Exec>>
filter
variants
align lanealign lane
align
sample
align lane
clean
sampleclean
sampleclean
sample
align lane
align
sample
align lane
recalibrate
sample
VC with
chr-split
haplotype
callerhaplotype
callerhaplotype
caller
annotates
ampleannotates
ampleannotates
ample
filter
samplefilter
samplefilter
sample
annotated
variantsannotated
variants
raw
sequencesraw
sequences
coverage
informationcoverage
information
coverage
per samplecoverage
per samplecoverage
per sample
recalibrate
Freebayes
callerFreebayes
caller
Freebayes
callerFreebayes
caller
-
26.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
35
From Cloud-eGenome to ReComp
• Variant calling / interpretation pipelines are still computationally expensive
• Workflow technology suitable and scalable, but flexibility largely an illusion
• With the additional advantage of automated recording of data provenance
-
27.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
37
Talk Outline
• The importance of quantifying changes to meta-knowledge, and
their impact
• Cloud e-Genome: WES data processing on the cloud using
workflow technology
• ReComp: selective re-analysis of cases in reaction to change
• Techniques and initial findings
• Open challenges
-
28.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
38
The ReComp meta-process
Estimate impact of
changes
Enact on
demand
Record execution
history
Detect and
measure
changes
History
DB
Data diff(.,.)
functions
Change
Events
Process P
Observe
Exec
1. Capture the history of past computations:
- Process Structure and dependencies
- Cost
- Provenance of the outcomes
2. Metadata analytics: Learn from history
- Estimation models for impact, cost, benefits
Approach:
1. Metadata
2. Data-specific “diff” functions
-
29.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
39
History DB: Workflow Provenance
Each invocation of an eSC workflow
generates a detailed provenance trace
http://vcvcomputing.com/provone/provone.html
User Execution
«Association » «Usage» «Generation »
«Entity»
«Collection»
Controller Program
Workflow Channel
Port
wasPartOf
«hadMember »
«wasDerivedFrom »
hasSubProgram
«hadPlan »
controlledBy
controls[*]
[*]
[*]
[*] [*] [*]
«wasDerivedFrom »
[*][*]
[0..1]
[0..1]
[0..1]
[*][1]
[*]
[*]
[0..1]
[0..1]
hasOutPort [*][0..1]
[1]
«wasAssociatedWith »
«agent »
[1]
[0..1]
[*]
[*]
[*] [*]
[*] [*]
[*]
[*] [*]
[*]
[*]
[*]
[0..1]
[0..1]
hasInPort [*][0..1]
connectsTo
[*]
[0..1]
«wasInformedBy »
[*][1]
«wasGeneratedBy »
«qualifiedGeneration »
«qualifiedUsage »
«qualifiedAssociation »
hadEntity
«used »
hadOutPorthadInPort
[*][1]
[1] [1]
[1] [1]
hadEntity
hasDefaultParam
-
30.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
40
Diff functions: example
ClinVar
1/2016
ClinVar
1/2017
diff
(unchanged)
-
31.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
41
Compute difference sets – ClinVar
The ClinVar dataset: 30 columns
Changes:
Records: 349,074 543,841
Added 200,746 Removed 5,979. Updated 27,662
-
32.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
42
A generic tool to compute difference sets for tabular data
Key columns: {"#AlleleID", "Assembly", "Chromosome”}
“where” columns:{"ClinicalSignificance”}
-
33.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
45
Reducing the computation performed in reaction to changes
Constraint: selective re-computation should be lossless
• All instances that are subject to impact will be considered
1. Partial re-execution
• Identify and re-enact the portion of a process that are affected by change
2. Differential execution
• Input to the new execution consists of the differences between two versions of a
changed dataset
• Only feasible if some algebraic properties of the process hold
3. Identifying the scope of change
• Determine which instances out of a population of outcomes are going to be
affected by the change
-
34.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
46
SVI as eScience Central workflow
Phenotype to genes
Variant selection
Variant classification
Patient
variants
GeneMap
ClinVar
Classified
variants
Phenotype
-
35.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
47
1. Partial re-execution
1. Change detection: A provenance fact indicates that a new version Dnew of
database d is available wasDerivedFrom(“db”,Dnew)
:- execution(WFexec), wasPartOf(B1exec,WFexec), used(B1exec, “db”)
2.1 Find the entry point(s) into the workflow, where db was used
:- execution(WFexec), execution(B1exec), execution(B2exec),
wasPartOf(B1exec, WFexec), wasPartOf(B2exec, WFexec),
wasGeneratedBy(Data, B1exec), used(B2exec,Data)
2.2 Discover the rest of the sub-workflow graph (execute recursively)
2. Reacting to the change:
Provenance
pattern:
“plan”
“plan
execution”
Ex. db = “ClinVar v.x”
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usagegeneration
association association
association
db
usage
-
36.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
48
Minimal sub-graphs in SVI
Change in
ClinVar
Change in
GeneMap
Partial execution following a change in only one of the databases requires caching the
intermediate data at the boundary of the blue and read areas
-
37.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
49
Generating the sub-workflow
-
38.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
50
Workflows are stored in the History DB
Neo4J
Graph DB
Serialise
-
39.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
51
Workflow copy’n’paste
Neo4J
Graph DB
ReComp
Partial rerun
{Starting blocks}
Sub-workflow
extractor
-
40.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
52
Workflow copy’n’paste
Neo4J
Graph DB
ReComp
Partial rerun
{Starting blocks}
exec
Sub-workflow
extractor
De-serialise
-
41.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
53
Results
• Overhead: storing interim data required in partial re-execution
• 156 MB for GeneMap changes and 37 kB for ClinVar changes
Time savings Partial re-
execution
(seC)
Complete re-
execution
Time saving
(%)
GeneMap 325 455 28.5
ClinVar 287 455 37
-
42.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
54
2. Differential execution
ClinVar
1/2016
ClinVar
1/2017
diff
(unchanged)
-
43.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
56
P2: Partial re-computation using input difference
Idea: run SVI but replace ClinVar query with a query on ClinVar version diff:
Q(CV) Q(diff(CV1, CV2))
Works for SVI, but hard to generalise: depends on the type of process
Bigger gain: diff(CV1, CV2) much smaller than CV2
GeneMap versions
from –> to
ToVersion record
count
Difference
record count Reduction
16-03-08 –> 16-06-07 15910 1458 91%
16-03-08 –> 16-04-28 15871 1386 91%
16-04-28 –> 16-06-01 15897 78 99.5%
16-06-01 –> 16-06-02 15897 2 99.99%
16-06-02 –> 16-06-07 15910 33 99.8%
ClinVar versions
from –> to
ToVersion record
count
Difference
record count Reduction
15-02 –> 16-05 290815 38216 87%
15-02 –> 16-02 285042 35550 88%
16-02 –> 16-05 290815 3322 98.9%
-
44.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
57
3. Identifying the scope of change: a game of battleship
Patient / change impact matrix
Challenge:
precisely identify the scope of a change
Blind reaction to change: recompute the entire matrix
Can we do better?
- Hit the high impact cases (the X) without re-
computing the entire matrix
-
45.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
58
A scoping algorithm
Candidate invocation:
Any invocation I of P
whose provenance
contains statements of
the form:
used(A,”db”),wasPartOf(A,I),wasAssociatedWith(I,_,WF)
- For each candidate invocation I of WF:
- partially re-execute using the difference sets as inputs # see (2)
- find the minimal subgraph P’ of P that needs re-computation # see (1)
- repeat:
execute P’ one step at-a-time
until <empty output> or <P’ completed>
- If <P’ completed> and not <empty output> then
- Execute P’ on the full inputs
Sketch of the algorithm:
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usagegeneration
association association
association
db
usage
-
46.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
59
Scoping: precision
• The approach avoids the majority of re-computations given a ClinVar change
• Reduction in number of complete re-executions from 495 down to 71
-
47.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
61
Conclusions: ReComp open problems
Change
Events
History
DB
Reproducibility:
- virtualisation
Sensitivity analysis unlikely to work well
Small input perturbations potentially large impact on diagnosis
Learning useful estimators is hard
Diff functions are both type-
and application-specific
Not all runtime environments
support provenance recording
specific generic
Data
diff(.,.)
functions
Process P
Observe
Exec
-
48.
ReComp–BITSmeeting
Italy,June,2017–P.Missier
62
Questions?
http://recomp.org.uk/
Genomics is a form of data-intensive / computation-intensive analysis
Changes in the reference databases have an impact on the classification
returns updates in mappings to genes that have changed between the two versions (including possibly new mappings):
$\diffOM(\OM^t, \OM^{t'}) = \{\langle t, genes(\dt) \rangle | genes(\dt) \neq genes'(\dt) \} $\\
where $genes'(\dt)$ is the new mapping for $\dt$ in $\OM^{t'}$.
\begin{align*}
\diffCV&(\CV^t, \CV^{t'}) = \\
&\{ \langle v, \varst(v) | \varst(v) \neq \varst'(v) \} \\
& \cup \CV^{t'} \setminus \CV^t \cup \CV^t \setminus \CV^{t'}
\label{eq:diff-cv}
\end{align*}
where $\varst'(v)$ is the new class associate to $v$ in $\CV^{t'}$.
Point of slide: sparsity of impact demands better than blind recomp.
Table 1 summarises the results. We recorded four types of outcomes. Firstly, confirming the current diagnosis (� ), which happens when additional variants are added to the
Red class. Secondly, retracting the diagnosis, which may happen (rarely) when all red variants are retracted, de-noted ❖. Thirdly, changes in the amber class which do not alter the diagnosis (� ), and finally, no change at all ( ).
`Table reports results from nearly 500 executions, concern-ing a cohort of 33 patients, for a total runtime of about 58.7 hours. As merely 14 relevant output changes were de-tected, this is about 4.2 hours of computation per change: a steep cost, considering that the actual execution time of SVI takes a little over 7 minutes.
our recommendation is the use of BWA-MEM and Samtools pipeline for SNP calls and BWA-MEM and GATK-HC pipeline for indel calls.
Changes can be frequent or rare, disruptive or marginal
Changes can be frequent or rare, disruptive or marginal
Wrapper blocks, such as Picard-CleanSAM and Picard-MarkDuplicates, communicate via files in the local filesystem of the workflow engine, which is explicitly de- noted as a connection between blocks. The workflow includes also utility blocks to import and export files, i.e. to transfer data from/to the shared data space (in this case, the Azure blob store).
These were com- plemented by e-SC shared libraries, which provide better efficiency in running the tools, as they are installed only once and cached by the workflow engine for any future use. Libraries also promote reproducibility because they eliminate dependencies on external data and services. For instance, to access the human reference genome we built and stored in the system a shared library that included the genome data in a specific version and flavour (precisely HG19 from UCSC).
A Modular architecture
Each sample included 2-lane, pair-end raw sequence reads (4 files per sample).The average size of compressed files was nearly 15 GiB per sample; file decompression was included in the pipeline as one of the initial tasks.
3 workflow engines perform better than our HPC benchmark on larger sample sizes
3 workflow engines perform better than our HPC benchmark on larger sample sizes
Largely, all variant calling pipelines consist of a number of common steps such as sequence alignment, calling variants and variant annotation
In details, they may differ, however, in how particular steps are connected together. For example:
alignment –> variant calling
GATK pipeline best practices suggest that between these steps the pipeline should run base quality score recalibration (our “GATK phase 1” workflow or the “recalibrate alignment” step).
Freebayes pipeline does not include this step.
variant calling –> annotation
GATK pipeline includes additional variant recalibration step for SNPs and INDELs.
Freebayes pipeline is simpler and does not include variant recalibration.
y^t = \mathit{exec}(P,x,D^t)
y^{t'}_+ = \mathit{exec}(P,x,\delta^+)
This is only a small selection of rows and a subset of columns. In total there was 30 columns, 349074 rows in the old set, 543841 rows in the new set, 200746 of the added rows, 5979 of the removed rows, 27662 of the changed rows.
As on the previous slide, you may want to highlight that the selection of key-columns and where-columns is very important. For example, using #AlleleID, Assembly and Chromosome as the key columns, we have entry #AlleleID 15091 which looks very similar in both added (green) and removed (red) sets. They differ, however, in the Chromosome column.
Considering the where-columns, using only ClinicalSignificance returns blue rows which differ between versions only in that columns. Changes in other columns (e.g. LastEvaluated) are not reported, which may have ramifications if such a difference is used to produce the new output.
This is only a small selection of rows and a subset of columns. In total there was 30 columns, 349074 rows in the old set, 543841 rows in the new set, 200746 of the added rows, 5979 of the removed rows, 27662 of the changed rows.
As on the previous slide, you may want to highlight that the selection of key-columns and where-columns is very important. For example, using #AlleleID, Assembly and Chromosome as the key columns, we have entry #AlleleID 15091 which looks very similar in both added (green) and removed (red) sets. They differ, however, in the Chromosome column.
Considering the where-columns, using only ClinicalSignificance returns blue rows which differ between versions only in that columns. Changes in other columns (e.g. LastEvaluated) are not reported, which may have ramifications if such a difference is used to produce the new output.
Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
Experimental setup for our study of ReComp techniques:
SVI workflow with automated provenance recording
Cohort of about 100 exomes (neurological disorders)
Changes in ClinVar and OMIM GeneMap
y^t = \mathit{exec}(P,x,D^t)
y^{t'}_+ = \mathit{exec}(P,x,\delta^+)
Also, as in Tab. 2 and 3 in the paper, I’d mention whether this reduction was possible with generic diff function or specific function tailored to SVI.
What is also interesting and what I would highlight is that even if the reduction is very close to 100% but below, the cost of recomputation of the process may still be significant because of some constant-time overheads related to running a process (e.g. loading data into memory). e-SC workflows suffer from exactly this issue (every block serializes and deserializes data) and that’s why Fig. 6 shows increase in runtime for GeneMap executed with 2 \deltas even if the reduction is 99.94% (cf. Tab. 2 and Fig. 6 for GeneMap diff between 16-10-30 –> 16-10-31).
Regarding the algorithm, you show the simplified version (Alg. 1). But please take also look on Alg. 2 and mention that you can only run the loop if the distributiveness holds for all P in the downstream graph. Otherwise, you need to break and re-execute on full inputs just after first non-distributive task produces a non-empty output. But, obviously, the hope is that with a well tailored diff function the output will be empty for majority of cases.
This figure emphasizes the penalty for running the algorithm when the differ- ence sets were large compared to actual new data. But it also highlights the importance of the diff and impact functions. Clearly, the more accurate the functions are the higher runtime savings may be, which stems from two fact. Firstly, more accurate diff function tends to pro- duce smaller difference sets which reduces time of task re-execution (cf. CV-diff and CV-SVI-diff lines in Fig. 7). Secondly, more accurate impact function tends to produce false more frequently, and so the algorithm can more of- ten avoid re-computation with the complete new version of the data (cf. the number of black squares vs the total number of patients affected by a change in Tab. 4)