SlideShare a Scribd company logo

Preserving the currency of analytics outcomes over time through selective re-computation: techniques, initial findings, and open challenges

Invited talk at university of Leeds School of Computing (School colloquia series), Nov. 24, 2017

1 of 51
Download to read offline
1
ReComp–UniversityofLeeds
November,2017
Preserving the currency of analytics outcomes over time
through selective re-computation:
techniques, initial findings, and open challenges
recomp.org.uk
Paolo Missier, Jacek Cala, Jannetta Steyn
School of Computing
Newcastle University, UK
University of Leeds
School of Computing Colloquia series
November, 2017
Meta-*
In collaboration with
• Cambridge University (Prof. Chinnery,
Department of Clinical Neurosciences)
• Institute of Genetic Medicine, Newcastle
University
• School of GeoSciences, Newcastle University
2
ReComp–UniversityofLeeds
November,2017
Data Science
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
3
ReComp–UniversityofLeeds
November,2017
Data Science over time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
Life Science
Analytics
4
ReComp–UniversityofLeeds
November,2017
Talk Outline
• The importance of quantifying changes to meta-knowledge, and
their impact
• ReComp: selective re-computation to refresh outcomes in reaction
to change
• Techniques and initial findings
• Open challenges
5
ReComp–UniversityofLeeds
November,2017
Data Analytics enabled by Next Gen Sequencing
Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
Submission of
sequence data for
archiving and analysis
Data analysis using
selected EBI and
external software tools
Data presentation and
visualisation through
web interface
Visualisation
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Metagenomics: Species identification
- Eg The EBI metagenomics portal
6
ReComp–UniversityofLeeds
November,2017
SVI: Simple Variant Interpretation
Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Filters then classifies variants into three categories: pathogenic,
benign and unknown/uncertain
SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya,
E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences,
Los Angeles, CA, 2015. Springer

Recommended

Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool developmentAnubhav Jain
 

More Related Content

What's hot

The Influence of the Java Collection Framework on Overall Energy Consumption
The Influence of the Java Collection Framework on Overall Energy ConsumptionThe Influence of the Java Collection Framework on Overall Energy Consumption
The Influence of the Java Collection Framework on Overall Energy ConsumptionGreenLabAtDI
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?Anubhav Jain
 
DuraMat Data Analytics
DuraMat Data AnalyticsDuraMat Data Analytics
DuraMat Data AnalyticsAnubhav Jain
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud Paolo Missier
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructureAnubhav Jain
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersIan Foster
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefRobert Grossman
 
Experiences with High-bandwidth Networks
Experiences with High-bandwidth NetworksExperiences with High-bandwidth Networks
Experiences with High-bandwidth Networksbalmanme
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
 
Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...INRIA-OAK
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentShaleen Kumar Gupta
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
 
CCLS Internship Presentation
CCLS Internship PresentationCCLS Internship Presentation
CCLS Internship PresentationCharles Naut
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 

What's hot (20)

The Influence of the Java Collection Framework on Overall Energy Consumption
The Influence of the Java Collection Framework on Overall Energy ConsumptionThe Influence of the Java Collection Framework on Overall Energy Consumption
The Influence of the Java Collection Framework on Overall Energy Consumption
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
DuraMat Data Analytics
DuraMat Data AnalyticsDuraMat Data Analytics
DuraMat Data Analytics
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Pathogen phylogenetics using BEAST
Pathogen phylogenetics using BEASTPathogen phylogenetics using BEAST
Pathogen phylogenetics using BEAST
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
 
Experiences with High-bandwidth Networks
Experiences with High-bandwidth NetworksExperiences with High-bandwidth Networks
Experiences with High-bandwidth Networks
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate Descent
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
CCLS Internship Presentation
CCLS Internship PresentationCCLS Internship Presentation
CCLS Internship Presentation
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 

Similar to Preserving the currency of analytics outcomes over time through selective re-computation: techniques, initial findings, and open challenges

Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...Paolo Missier
 
ReComp: Preserving the value of large scale data analytics over time through...
ReComp:Preserving the value of large scale data analytics over time through...ReComp:Preserving the value of large scale data analytics over time through...
ReComp: Preserving the value of large scale data analytics over time through...Paolo Missier
 
ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp: challenges in selective recomputation of (expensive) data analytics t...ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp: challenges in selective recomputation of (expensive) data analytics t...Paolo Missier
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’ Paolo Missier
 
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...Matthieu Schapranow
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 
Fostering Serendipity through Big Linked Data
Fostering Serendipity through Big Linked DataFostering Serendipity through Big Linked Data
Fostering Serendipity through Big Linked DataMuhammad Saleem
 
Bda2015 tutorial-part2-data&databases
Bda2015 tutorial-part2-data&databasesBda2015 tutorial-part2-data&databases
Bda2015 tutorial-part2-data&databasesInterpretOmics
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Reusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureReusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureDavid LeBauer
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuAnne Deslattes Mays
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Databricks
 
Open Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysisOpen Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysisAntica Culina
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Matthieu Schapranow
 

Similar to Preserving the currency of analytics outcomes over time through selective re-computation: techniques, initial findings, and open challenges (20)

Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...
 
ReComp: Preserving the value of large scale data analytics over time through...
ReComp:Preserving the value of large scale data analytics over time through...ReComp:Preserving the value of large scale data analytics over time through...
ReComp: Preserving the value of large scale data analytics over time through...
 
ReComp for genomics
ReComp for genomicsReComp for genomics
ReComp for genomics
 
ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp: challenges in selective recomputation of (expensive) data analytics t...ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp: challenges in selective recomputation of (expensive) data analytics t...
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’
 
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
Fostering Serendipity through Big Linked Data
Fostering Serendipity through Big Linked DataFostering Serendipity through Big Linked Data
Fostering Serendipity through Big Linked Data
 
Bda2015 tutorial-part2-data&databases
Bda2015 tutorial-part2-data&databasesBda2015 tutorial-part2-data&databases
Bda2015 tutorial-part2-data&databases
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Reusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureReusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize Agriculture
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
 
BioData World Basel 2018
BioData World Basel 2018BioData World Basel 2018
BioData World Basel 2018
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
 
Open Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysisOpen Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysis
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI
 

More from Paolo Missier

Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthPaolo Missier
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationPaolo Missier
 
Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Paolo Missier
 

More from Paolo Missier (20)

Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-Computation
 
Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)
 

Recently uploaded

21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERNRonnelBaroc
 
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
"Running Open-Source LLM models on Kubernetes",  Volodymyr Tsap"Running Open-Source LLM models on Kubernetes",  Volodymyr Tsap
"Running Open-Source LLM models on Kubernetes", Volodymyr TsapFwdays
 
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...Product School
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanDatabarracks
 
How AI and ChatGPT are changing cybersecurity forever.pptx
How AI and ChatGPT are changing cybersecurity forever.pptxHow AI and ChatGPT are changing cybersecurity forever.pptx
How AI and ChatGPT are changing cybersecurity forever.pptxInfosec
 
The Future of Product, by Founder & CEO, Product School
The Future of Product, by Founder & CEO, Product SchoolThe Future of Product, by Founder & CEO, Product School
The Future of Product, by Founder & CEO, Product SchoolProduct School
 
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17Ana-Maria Mihalceanu
 
Are Human-generated Demonstrations Necessary for In-context Learning?
Are Human-generated Demonstrations Necessary for In-context Learning?Are Human-generated Demonstrations Necessary for In-context Learning?
Are Human-generated Demonstrations Necessary for In-context Learning?MENGSAYLOEM1
 
Campotel: Telecommunications Infra and Network Builder - Company Profile
Campotel: Telecommunications Infra and Network Builder - Company ProfileCampotel: Telecommunications Infra and Network Builder - Company Profile
Campotel: Telecommunications Infra and Network Builder - Company ProfileCampotelPhilippines
 
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24Umar Saif
 
Introducing the New FME Community Webinar - Feb 21, 2024 (2).pdf
Introducing the New FME Community Webinar - Feb 21, 2024 (2).pdfIntroducing the New FME Community Webinar - Feb 21, 2024 (2).pdf
Introducing the New FME Community Webinar - Feb 21, 2024 (2).pdfSafe Software
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...Neo4j
 
Battle of React State Managers in frontend applications
Battle of React State Managers in frontend applicationsBattle of React State Managers in frontend applications
Battle of React State Managers in frontend applicationsEvangelia Mitsopoulou
 
Relationship Counselling: From Disjointed Features to Product-First Thinking ...
Relationship Counselling: From Disjointed Features to Product-First Thinking ...Relationship Counselling: From Disjointed Features to Product-First Thinking ...
Relationship Counselling: From Disjointed Features to Product-First Thinking ...Product School
 
Confoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceConfoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceSusan Ibach
 
Apex Replay Debugger and Salesforce Platform Events.pptx
Apex Replay Debugger and Salesforce Platform Events.pptxApex Replay Debugger and Salesforce Platform Events.pptx
Apex Replay Debugger and Salesforce Platform Events.pptxmohayyudin7826
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxNeo4j
 
Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...
Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...
Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...Product School
 
IT Nation Evolve event 2024 - Quarter 1
IT Nation Evolve event 2024  - Quarter 1IT Nation Evolve event 2024  - Quarter 1
IT Nation Evolve event 2024 - Quarter 1Inbay UK
 

Recently uploaded (20)

21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
 
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
"Running Open-Source LLM models on Kubernetes",  Volodymyr Tsap"Running Open-Source LLM models on Kubernetes",  Volodymyr Tsap
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
 
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response Plan
 
How AI and ChatGPT are changing cybersecurity forever.pptx
How AI and ChatGPT are changing cybersecurity forever.pptxHow AI and ChatGPT are changing cybersecurity forever.pptx
How AI and ChatGPT are changing cybersecurity forever.pptx
 
The Future of Product, by Founder & CEO, Product School
The Future of Product, by Founder & CEO, Product SchoolThe Future of Product, by Founder & CEO, Product School
The Future of Product, by Founder & CEO, Product School
 
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17
 
Are Human-generated Demonstrations Necessary for In-context Learning?
Are Human-generated Demonstrations Necessary for In-context Learning?Are Human-generated Demonstrations Necessary for In-context Learning?
Are Human-generated Demonstrations Necessary for In-context Learning?
 
Campotel: Telecommunications Infra and Network Builder - Company Profile
Campotel: Telecommunications Infra and Network Builder - Company ProfileCampotel: Telecommunications Infra and Network Builder - Company Profile
Campotel: Telecommunications Infra and Network Builder - Company Profile
 
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
 
Introducing the New FME Community Webinar - Feb 21, 2024 (2).pdf
Introducing the New FME Community Webinar - Feb 21, 2024 (2).pdfIntroducing the New FME Community Webinar - Feb 21, 2024 (2).pdf
Introducing the New FME Community Webinar - Feb 21, 2024 (2).pdf
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
 
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
 
Battle of React State Managers in frontend applications
Battle of React State Managers in frontend applicationsBattle of React State Managers in frontend applications
Battle of React State Managers in frontend applications
 
Relationship Counselling: From Disjointed Features to Product-First Thinking ...
Relationship Counselling: From Disjointed Features to Product-First Thinking ...Relationship Counselling: From Disjointed Features to Product-First Thinking ...
Relationship Counselling: From Disjointed Features to Product-First Thinking ...
 
Confoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceConfoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data science
 
Apex Replay Debugger and Salesforce Platform Events.pptx
Apex Replay Debugger and Salesforce Platform Events.pptxApex Replay Debugger and Salesforce Platform Events.pptx
Apex Replay Debugger and Salesforce Platform Events.pptx
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
 
Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...
Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...
Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...
 
IT Nation Evolve event 2024 - Quarter 1
IT Nation Evolve event 2024  - Quarter 1IT Nation Evolve event 2024  - Quarter 1
IT Nation Evolve event 2024 - Quarter 1
 

Preserving the currency of analytics outcomes over time through selective re-computation: techniques, initial findings, and open challenges

  • 1. 1 ReComp–UniversityofLeeds November,2017 Preserving the currency of analytics outcomes over time through selective re-computation: techniques, initial findings, and open challenges recomp.org.uk Paolo Missier, Jacek Cala, Jannetta Steyn School of Computing Newcastle University, UK University of Leeds School of Computing Colloquia series November, 2017 Meta-* In collaboration with • Cambridge University (Prof. Chinnery, Department of Clinical Neurosciences) • Institute of Genetic Medicine, Newcastle University • School of GeoSciences, Newcastle University
  • 3. 3 ReComp–UniversityofLeeds November,2017 Data Science over time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Life Science Analytics
  • 4. 4 ReComp–UniversityofLeeds November,2017 Talk Outline • The importance of quantifying changes to meta-knowledge, and their impact • ReComp: selective re-computation to refresh outcomes in reaction to change • Techniques and initial findings • Open challenges
  • 5. 5 ReComp–UniversityofLeeds November,2017 Data Analytics enabled by Next Gen Sequencing Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP Submission of sequence data for archiving and analysis Data analysis using selected EBI and external software tools Data presentation and visualisation through web interface Visualisation raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Metagenomics: Species identification - Eg The EBI metagenomics portal
  • 6. 6 ReComp–UniversityofLeeds November,2017 SVI: Simple Variant Interpretation Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Filters then classifies variants into three categories: pathogenic, benign and unknown/uncertain SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
  • 7. 7 ReComp–UniversityofLeeds November,2017 Changes that affect variant interpretation What changes: - Improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources Evolution in number of variants that affect patients (a) with a specific phenotype (b) Across all phenotypes
  • 8. 8 ReComp–UniversityofLeeds November,2017 Baseline: blind re-computation Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected: 4.2 hours of computation per change ≈7 minutes / patient (single-core VM) Should we care about database updates?
  • 9. 9 ReComp–UniversityofLeeds November,2017 Whole-exome variant calling: expensive Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43 GATK quality score recalibration Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations BWA, Bowtie, Novoalign Picard: MarkDuplicates GATK-Haplotype Caller FreeBayes SamTools Variant recalibration
  • 10. 10 ReComp–UniversityofLeeds November,2017 Whole-Exome Sequencing pipeline: scale Data stats per sample: 4 files per sample (2-lane, pair-end, reads) ≈15 GB of compressed text data (gz) ≈40 GB uncompressed text data (FASTQ) Usually 30-40 input samples 0.45-0.6 TB of compressed data 1.2-1.6 TB uncompressed Most steps use 8-10 GB of reference data Small 6-sample run takes about 30h on the IGM HPC machine (Stage1+2) Scalable and Efficient Whole-exome Data Processing Using Workflows on the Cloud. Cala, J.; Marei, E.; Yu, Y.; Takeda, K.; and Missier, P. Future Generation Computer Systems, Special Issue: Big Data in the Cloud, 2016
  • 11. 11 ReComp–UniversityofLeeds November,2017 Workflow Design echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP mkdir -p $PICARD_OUTDIR mkdir -p $PICARD_TEMP echo Starting PICARD to clean BAM files... $Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED echo Starting PICARD to remove duplicates... $Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = $SORTED_BAM_FILE_NODUPS_NO_RG METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true echo Adding read group information to bam file... $Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}” echo Indexing bam files... samtools index $SORTED_BAM_FILE_NODUPS “Wrapper” blocksUtility blocks From To
  • 12. 12 ReComp–UniversityofLeeds November,2017 Workflow design raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Conceptual: Actual: 11 workflows 101 blocks 28 tool blocks
  • 13. 13 ReComp–UniversityofLeeds November,2017 Parallelism in the pipeline align-clean- recalibrate-coverage … align-clean- recalibrate-coverage Sample 1 Sample n Variant calling recalibration Variant calling recalibration Variant filtering annotation Variant filtering annotation …… Chromosome split Per-sample Parallel processing Per-chromosome Parallel processing Stage I Stage II Stage III
  • 14. 15 ReComp–UniversityofLeeds November,2017 Performance Configurations for 3VMs experiments: Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04. 00:00 12:00 24:00 36:00 48:00 60:00 72:00 0 6 12 18 24 Responsetime[hh:mm] Number of samples 3 eng (24 cores) 6 eng (48 cores) 12 eng (96 cores)
  • 15. 17 ReComp–UniversityofLeeds November,2017 Whole-exome variant calling: unstable Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43 GATK quality score recalibration Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations BWA, Bowtie, Novoalign Picard: MarkDuplicates GATK-Haplotype Caller FreeBayes SamTools Variant recalibration dbSNP builds 150 2/17 149 11/16 148 6/16 147 4/16 Any of these stages may change over time – semi-independently
  • 16. 18 ReComp–UniversityofLeeds November,2017 Comparing three versions of Freebayes Should we care about changes in the pipeline? • Tested three versions of the caller: • 0.9.10  Dec 2013 • 1.0.2  Dec 2015 • 1.1  Nov 2016 • The Venn diagram shows quantitative comparison (% and number) of filtered variants; • Phred quality score >30 • 16 patient BAM files (7 AD, 9 FTD-ALS)
  • 17. 20 ReComp–UniversityofLeeds November,2017 Impact on SVI classification Patient phenotypes: 7 Alzheimer’s, 9 FTD-ALS The ONLY change in the pipeline is the version of Freebayes used to call variants (R)ed – confirmed pathogenicity (A)mber – uncertain pathogenicity Patient ID Freebayes version B_0190 B_0191 B_0192 B_0193 B_0195 B_0196 B_0198 B_0199 B_0201 B_0202 B_0203 B_0208 B_0209 B_0211 B_0213 B_0214 0.9.10 A A R A R R R R R A R R R R A R 1.0.2 A A R A R R A A R A R A R A A R 1.1 A A R A R R A A R A R A R A A R Phenotype ALS-FTD ALS-FTD ALS-FTD ALS-FTD ALS-FTD ALS-FTD AD ALS-FTD AD AD AD AD AD ALS-FTD ALS-FTD AD
  • 18. 21 ReComp–UniversityofLeeds November,2017 Changes: frequency / impact / cost Change Frequency Changeimpactonacohort GATK Variant annotations (Annovar) Reference Human genome Variant DB (eg ClinVar) Phenotype  disease mapping (eg OMIM GeneMap) New sequences LowHigh Low High Variant Caller Variant calling N+1 problem Variant interpretation
  • 19. 22 ReComp–UniversityofLeeds November,2017 Changes: frequency / impact / cost Change Frequency Changeimpactonacohort GATK Variant annotations (Annovar) Reference Human genome Variant DB (eg ClinVar) Phenotype  disease mapping (eg OMIM GeneMap) New sequences LowHigh Low High Variant Caller Variant calling N+1 problem Variant interpretation ReComp space
  • 20. 23 ReComp–UniversityofLeeds November,2017 Understanding change Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t • Threats: Will any of the changes invalidate prior findings? • Opportunities: Can the findings be improved over time? Challenge space = expensive analysis + frequent changes + high impact Case studies: - Bioinformatics pipelines: long running time per case x thousands of cases (- Long-running simulations: modelling flood events with terrain changes)
  • 21. 24 ReComp–UniversityofLeeds November,2017 When should we repeat an expensive simulation? CityCat Flood simulator CityCat Flood simulator Can we predict high difference areas? New buildings may alter data flow Processing the Newcastle area: 5 hours Extreme Rainfall event
  • 22. 25 ReComp–UniversityofLeeds November,2017 Talk Outline • The importance of quantifying changes to meta-knowledge, and their impact • ReComp: selective re-computation to refresh outcomes in reaction to change • Techniques and initial findings • Open challenges Project structure • 3 years funding - Feb. 2016 - Jan. 2019 • In collaboration with • Cambridge University (Prof. Chinnery, Department of Clinical Neurosciences) • Institute of Genetic Medicine, Newcastle University • School of GeoSciences, Newcastle University
  • 23. 26 ReComp–UniversityofLeeds November,2017 The ReComp meta-process Estimate impact of changes Select and Enact Record execution history Detect and measure changes History DB Data diff(.,.) functions Change Events Process P Observe Exec 1. Capture the history of past computations: - Process Structure and dependencies - Cost - Provenance of the outcomes 2. Metadata analytics: Learn from history - Estimation models for impact, cost, benefits Approach: 2. Collect and exploit process history metadata 1. Quantify data-diff and impact of changes on prior outcomes Changes: • Algorithms and tools • Accuracy of input sequences • Reference databases (HGMD, ClinVar, OMIM GeneMap…)
  • 25. 28 ReComp–UniversityofLeeds November,2017 Compute difference sets – ClinVar The ClinVar dataset: 30 columns Changes: Records: 349,074  543,841 Added 200,746 Removed 5,979. Updated 27,662
  • 26. 29 ReComp–UniversityofLeeds November,2017 For tabular data, difference is just Select-Project Key columns: {"#AlleleID", "Assembly", "Chromosome”} “where” columns:{"ClinicalSignificance”}
  • 27. 31 ReComp–UniversityofLeeds November,2017 History DB: Workflow Provenance Each invocation of an eSC workflow generates a provenance trace http://vcvcomputing.com/provone/provone.html User Execution «Association » «Usage» «Generation » «Entity» «Collection» Controller Program Workflow Channel Port wasPartOf «hadMember » «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls[*] [*] [*] [*] [*] [*] «wasDerivedFrom » [*][*] [0..1] [0..1] [0..1] [*][1] [*] [*] [0..1] [0..1] hasOutPort [*][0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*][0..1] connectsTo [*] [0..1] «wasInformedBy » [*][1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPorthadInPort [*][1] [1] [1] [1] [1] hadEntity hasDefaultParam “plan” “plan execution” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  • 28. 32 ReComp–UniversityofLeeds November,2017 Approach – a combination of techniques 1. Partial re-execution • Identify and re-enact the portion of a process that are affected by change 2. Differential execution • Input to the new execution consists of the differences between two versions of a changed dataset • Only feasible if some algebraic properties of the process hold 3. Identifying the scope of change – Loss-less • Exclude instances of the population that are certainly not affected
  • 29. 33 ReComp–UniversityofLeeds November,2017 Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  • 30. 34 ReComp–UniversityofLeeds November,2017 SVI as eScience Central workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype
  • 31. 35 ReComp–UniversityofLeeds November,2017 1. Partial re-execution 1. Change detection: A provenance fact indicates that a new version Dnew of database d is available wasDerivedFrom(“db”,Dnew) :- execution(WFexec), wasPartOf(Xexec,WFexec), used(Xexec, “db”) 2.1 Find the entry point(s) into the workflow, where db was used :- execution(WFexec), execution(B1exec), execution(B2exec), wasPartOf(B1exec, WFexec), wasPartOf(B2exec, WFexec), wasGeneratedBy(Data, B1exec), used(B2exec,Data) 2.2 Discover the rest of the sub-workflow graph (execute recursively) 2. Reacting to the change: Provenance pattern: “plan” “plan execution” Ex. db = “ClinVar v.x” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  • 32. 36 ReComp–UniversityofLeeds November,2017 Minimal sub-graphs in SVI Change in ClinVar Change in GeneMap Overhead: cache intermediate data required for partial re-execution • 156 MB for GeneMap changes and 37 kB for ClinVar changes Time savings Partial re- execution (seC) Complete re- execution Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37
  • 33. 41 ReComp–UniversityofLeeds November,2017 Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  • 35. 43 ReComp–UniversityofLeeds November,2017 P2: Differential execution Suppose D is a relation (a table). diffD() can be expressed as: Where: We compute: as the combination of: This is effective if: This can be achieved as follows: …provided P satisfies the required algebraic properties
  • 36. 44 ReComp–UniversityofLeeds November,2017 P2: Partial re-computation using input difference Idea: run SVI but replace ClinVar query with a query on ClinVar version diff: Q(CV)  Q(diff(CV1, CV2)) Works for SVI, but hard to generalise: depends on the type of process Bigger gain: diff(CV1, CV2) much smaller than CV2 GeneMap versions from –> to ToVersion record count Difference record count Reduction 16-03-08 –> 16-06-07 15910 1458 91% 16-03-08 –> 16-04-28 15871 1386 91% 16-04-28 –> 16-06-01 15897 78 99.5% 16-06-01 –> 16-06-02 15897 2 99.99% 16-06-02 –> 16-06-07 15910 33 99.8% ClinVar versions from –> to ToVersion record count Difference record count Reduction 15-02 –> 16-05 290815 38216 87% 15-02 –> 16-02 285042 35550 88% 16-02 –> 16-05 290815 3322 98.9%
  • 37. 45 ReComp–UniversityofLeeds November,2017 Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  • 38. 46 ReComp–UniversityofLeeds November,2017 3: precisely identify the scope of a change Patient / DB version impact matrix Strong scope: (fine-grained provenance) Weak scope: “if CVi was used in the processing of pj then pj is in scope” (coarse-grained provenance – next slide) Semantic scope: (domain-specific scoping rules)
  • 39. 47 ReComp–UniversityofLeeds November,2017 A weak scoping algorithm Coarse-grained provenance Candidate invocation: Any invocation I of P whose provenance contains statements of the form: used(A,”db”),wasPartOf(A,I),wasAssociatedWith(I,_,WF) - For each candidate invocation I of P: - partially re-execute using the difference sets as inputs # see (2) - find the minimal subgraph P’ of P that needs re-computation # see (1) - repeat: execute P’ one step at-a-time until <empty output> or <P’ completed> - If <P’ completed> and not <empty output> then - Execute P’ on the full inputs Sketch of the algorithm: WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  • 40. 48 ReComp–UniversityofLeeds November,2017 Scoping: precision • The approach avoids the majority of re-computations given a ClinVar change • Reduction in number of complete re-executions from 495 down to 71
  • 41. 49 ReComp–UniversityofLeeds November,2017 ReComp challenges Change Events History DB Reproducibility: - virtualisation Sensitivity analysis unlikely to work well Small input perturbations  potentially large impact on diagnosis Learning useful estimators is hard Diff functions are both type- and application-specific Not all runtime environments support provenance recording specific  generic Data diff(.,.) functions Process P Observe Exec
  • 43. 51 ReComp–UniversityofLeeds November,2017 The Metadata Analytics challenge: Learning from a metadata DB of execution history to support automated ReComp decisions
  • 44. 52 ReComp–UniversityofLeeds November,2017 changes, data diff, impact 1) Observed change events: (inputs, dependencies, or both) 3) Impact occurs to various degree on multiple prior outcomes. Impact of change C on the processing of a specific X: 2) Type-specific Diff functions: Impact is process- and data-specific:
  • 45. 53 ReComp–UniversityofLeeds November,2017 Impact: importance and Scope Scope: which cases are affected? - Individual variants have an associated phenotype. - Patient cases also have a phenotype “a change in variant v can only have impact on a case X if V and X share the same phenotype” Importance: “Any variant with status moving from/to Red causes High impact on any X that is affected by the variant”
  • 46. 54 ReComp–UniversityofLeeds November,2017 History Database HDB: A metadata-database containing records of past executions: Execution records: C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB Example: Consider only one type of change: Variant caller
  • 47. 55 ReComp–UniversityofLeeds November,2017 ReComp decisions Given: - A population X of processed inputs: - Change ReComp must learn to make yes/no decisions for each returns True if P is to be executed again on X, and False otherwise To decide, ReComp must estimate impact: (as well as estimate the re-computation cost) Example: Objective: maximise reward
  • 48. 57 ReComp–UniversityofLeeds November,2017 History DB and Differences DB Whenever P is re-computed on input X, a new er’ is added to HDB for X: Using diff() functions we produce a derived difference record dr: … collected in a Differences database: dr1 = Imp(C1,X1) dr2= Imp(C12,X4) dr3 = Imp(C1,X5) dr4 = Imp(C2,X5) DDB C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB   
  • 50. 59 ReComp–UniversityofLeeds November,2017 Learning challenges • Evidence is small and sparse • How can it be used for selecting from X? • Learning a reliable imp() function is not feasible • What’s the use of history? You never see the same change twice! • Must somehow use evidence from related changes • A possible approach: • ReComp makes probabilistic decisions, takes chances • Associate a reward to each ReComp decision  reinforcement learning • Bayesian inference (use new evidence to update probabilities) X1 X2 X3 X4 X5 HDB dr1 = Imp(C1,X1) dr2= Imp(C12,X4) dr3 = Imp(C1,X5) dr4 = Imp(C2,X5) DDB C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53

Editor's Notes

  1. Genomics is a form of data-intensive / computation-intensive analysis
  2. Changes in the reference databases have an impact on the classification
  3. returns updates in mappings to genes that have changed between the two versions (including possibly new mappings): $\diffOM(\OM^t, \OM^{t'}) = \{\langle t, genes(\dt) \rangle | genes(\dt) \neq genes'(\dt) \} $\\ where $genes'(\dt)$ is the new mapping for $\dt$ in $\OM^{t'}$. \begin{align*} \diffCV&(\CV^t, \CV^{t'}) = \\ &\{ \langle v, \varst(v) | \varst(v) \neq \varst'(v) \} \\ & \cup \CV^{t'} \setminus \CV^t \cup \CV^t \setminus \CV^{t'} \label{eq:diff-cv} \end{align*} where $\varst'(v)$ is the new class associate to $v$ in $\CV^{t'}$.
  4. Point of slide: sparsity of impact demands better than blind recomp. Table 1 summarises the results. We recorded four types of outcomes. Firstly, confirming the current diagnosis (� ), which happens when additional variants are added to the Red class. Secondly, retracting the diagnosis, which may happen (rarely) when all red variants are retracted, de-noted ❖. Thirdly, changes in the amber class which do not alter the diagnosis (� ), and finally, no change at all ( ). `Table reports results from nearly 500 executions, concern-ing a cohort of 33 patients, for a total runtime of about 58.7 hours. As merely 14 relevant output changes were de-tected, this is about 4.2 hours of computation per change: a steep cost, considering that the actual execution time of SVI takes a little over 7 minutes.
  5. Wrapper blocks, such as Picard-CleanSAM and Picard-MarkDuplicates, communicate via files in the local filesystem of the workflow engine, which is explicitly de- noted as a connection between blocks. The workflow includes also utility blocks to import and export files, i.e. to transfer data from/to the shared data space (in this case, the Azure blob store). These were com- plemented by e-SC shared libraries, which provide better efficiency in running the tools, as they are installed only once and cached by the workflow engine for any future use. Libraries also promote reproducibility because they eliminate dependencies on external data and services. For instance, to access the human reference genome we built and stored in the system a shared library that included the genome data in a specific version and flavour (precisely HG19 from UCSC).
  6. A Modular architecture
  7. Each sample included 2-lane, pair-end raw sequence reads (4 files per sample).The average size of compressed files was nearly 15 GiB per sample; file decompression was included in the pipeline as one of the initial tasks.
  8. 3 workflow engines perform better than our HPC benchmark on larger sample sizes
  9. our recommendation is the use of BWA-MEM and Samtools pipeline for SNP calls and BWA-MEM and GATK-HC pipeline for indel calls. 
  10. In four cases change in the caller version changes the classification
  11. Changes can be frequent or rare, disruptive or marginal
  12. Changes can be frequent or rare, disruptive or marginal
  13. y^t = \mathit{exec}(P,x,D^t) y^{t'}_+ = \mathit{exec}(P,x,\delta^+)
  14. This is only a small selection of rows and a subset of columns. In total there was 30 columns, 349074 rows in the old set, 543841 rows in the new set, 200746 of the added rows, 5979 of the removed rows, 27662 of the changed rows. As on the previous slide, you may want to highlight that the selection of key-columns and where-columns is very important. For example, using #AlleleID, Assembly and Chromosome as the key columns, we have entry #AlleleID 15091 which looks very similar in both added (green) and removed (red) sets. They differ, however, in the Chromosome column. Considering the where-columns, using only ClinicalSignificance returns blue rows which differ between versions only in that columns. Changes in other columns (e.g. LastEvaluated) are not reported, which may have ramifications if such a difference is used to produce the new output.
  15. This is only a small selection of rows and a subset of columns. In total there was 30 columns, 349074 rows in the old set, 543841 rows in the new set, 200746 of the added rows, 5979 of the removed rows, 27662 of the changed rows. As on the previous slide, you may want to highlight that the selection of key-columns and where-columns is very important. For example, using #AlleleID, Assembly and Chromosome as the key columns, we have entry #AlleleID 15091 which looks very similar in both added (green) and removed (red) sets. They differ, however, in the Chromosome column. Considering the where-columns, using only ClinicalSignificance returns blue rows which differ between versions only in that columns. Changes in other columns (e.g. LastEvaluated) are not reported, which may have ramifications if such a difference is used to produce the new output.
  16. Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
  17. Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
  18. Experimental setup for our study of ReComp techniques: SVI workflow with automated provenance recording Cohort of about 100 exomes (neurological disorders) Changes in ClinVar and OMIM GeneMap
  19. Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
  20. y^t = \mathit{exec}(P,x,D^t) y^{t'}_+ = \mathit{exec}(P,x,\delta^+) \delta^- \cup \delta^+
  21. Also, as in Tab. 2 and 3 in the paper, I’d mention whether this reduction was possible with generic diff function or specific function tailored to SVI. What is also interesting and what I would highlight is that even if the reduction is very close to 100% but below, the cost of recomputation of the process may still be significant because of some constant-time overheads related to running a process (e.g. loading data into memory). e-SC workflows suffer from exactly this issue (every block serializes and deserializes data) and that’s why Fig. 6 shows increase in runtime for GeneMap executed with 2 \deltas even if the reduction is 99.94% (cf. Tab. 2 and Fig. 6 for GeneMap diff between 16-10-30 –> 16-10-31).
  22. Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
  23. v \in (\delta^- \cup \delta^+) \cap \mathit{used}(p_j, v) \Rightarrow p_j \text{ in scope } v.\mathit{phenotype} == p_j.\mathit{phenotype} \Rightarrow p_j \text{ in scope }
  24. Regarding the algorithm, you show the simplified version (Alg. 1). But please take also look on Alg. 2 and mention that you can only run the loop if the distributiveness holds for all P in the downstream graph. Otherwise, you need to break and re-execute on full inputs just after first non-distributive task produces a non-empty output. But, obviously, the hope is that with a well tailored diff function the output will be empty for majority of cases.
  25. \diff{X}(X^t, X^{t'}), \quad \diff{Y}(Y^t, Y^{t'}), \quad \diff{D}(D^t,D^{t'})  C = \{\update{D^{t'}}{D^t},  \quad  \update{X^{t'}}{X^t} \} Y^t &= \exec(P, X, D^t) \\ Y^{t'} &= \exec(P, X, D^{t'}) \\ \impact_{P}(C,X) &= f_{P}( \diff{Y}(Y^t, Y^{t'})) \imp{C,X}_{SVI} = f_{SVI}( \diff{Y}(Y^t, Y^{t'})) \in \{ \texttt{None}, \texttt{Low}, \texttt{High} \}
  26. \text{let } v \in \diff{Y}(Y^t, Y^{t'}): \\ \text{for any $X$: } \impact_{P}(C,X) = \texttt{High} \text{ if }\\ v.\texttt{status:} \begin{cases} * \rightarrow \texttt{red} \\ \texttt{red} \rightarrow *  \end{cases}
  27. er = \langle P, X^{t}, D^{t}, Y^{t}, c^{t}, T \rangle   \HDB = \{  er_1, er_2 \dots er_N \} {\cal X} = \{ er.X | er \in \HDB\}
  28. C = \update{D^{t'}}{D^t}, \update{X^{t'}}{X^t} X \in {\cal X} \imphat_P(C,X) \costhat(C,X) \recomp_P(C,X) \impact_{P}(C,X) \langle X^{t}, D^{t}, Y^{t}, c^{t} \rangle \qquad \langle X^{t'}, D^{t'}, Y^{t'}, c^{t'} \rangle  \recomp_{SVI}(C,X) = \begin{cases} \text{True} & \text{if } \imphat_{SVI}(C,X) \neq \texttt{None} \\ \text{False} & \text{otherwise} \end{cases}
  29. \begin{tabular}{|c|c|c|} \hline  \rule[-1ex]{0pt}{2.5ex}  & \multicolumn{2}{c|}{\textbf{Recompute?}} \\  \hline  \rule[-1ex]{0pt}{2.5ex} impact & yes & no \\  \hline  \rule[-1ex]{0pt}{2.5ex} None & -10 & +20 \\  \hline  \rule[-1ex]{0pt}{2.5ex} Low & +1 & -1 \\  \hline  \rule[-1ex]{0pt}{2.5ex} High & +2 & -100 \\  \hline  \end{tabular}  \reward(\recomp_P(C,X),   \impact_P(C,X))
  30. er = \langle X^{t}, D^{t}, Y^{t}, c^{t} \rangle \qquad er' = \langle X^{t'}, D^{t'}, Y^{t'}, c^{t'} \rangle  \dr = \langle \diff_X(X^{t}, X^{t'}), \diff_D(D^{t}, D^{t'}), \diff_Y(, Y^{t},Y^{t'}, \imp{C,X} \rangle \DDB = \{ \dr_1, \dr_2 \dots \dr_M \}
  31. \begin{algorithm}[H] \SetCustomAlgoRuledWidth{\textwidth}  \KwData{Evidence $E = \{ \mathit{HDB}, \mathit{DDB} \}$, Population ${\cal X}$, change $C$}   \KwResult{Updated outcomes for a subset ${\cal X}' \subseteq {\cal X}$, updated Evidence}   $\mathit{dv} = \mathbf{1}$\;  \While{ $\mathit{dv} != \mathbf{0}$}     {     $\mathit{dv} = \mathit{select}(E,C)$ \tcc{ binary \textit{decision vector} of size $|{\cal X}|$}     $[Y_i^{t'}]_{i:1 \dots k} = \mathit{execAll}(dv, {\cal X})$ \tcc{Re-comp all $k$ selected $X \in {\cal X}$}     $I = [ \imp{Y_i^{t}, Y_i^{t'}}]_{i:1 \dots k}$ \tcc{calculate impact from the new outcomes}     $E = \mathit{updateEvidence}(E,I)$ \tcc{update evidence adding new impact}     }        \end{algorithm}