SlideShare a Scribd company logo
TAPP’16
P.Missier,2016
The data, they are a-changin’
(ReComp: Your Data Will Not Stay Smart Forever)
Paolo Missier, Jacek Cala, Eldarina Wijaya
School of Computing Science,
Newcastle University
{firstname.lastname}@ncl.ac.uk
TAPP’16
McLean, VA, USA
June, 2016
(*) Painting by Johannes Moreelse
(*)
Panta Rhei
(Heraclitus, through Plato)
TAPP’16
P.Missier,2016
Data to Knowledge
Lots of
Data
Big
Analytics
Machine
“Valuable
Knowledge”
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
TAPP’16
P.Missier,2016
The missing element: time
Lots of
Data
Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
Your Data Will Not Stay Smart Forever
TAPP’16
P.Missier,2016
ReComp
Observe change
• In input data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Lots of
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
TAPP’16
P.Missier,2016
The ReComp decision support system
Observe
change
Assess and
measure
Estimate
Enact
Change
Events
Diff(.,.)
functions
utility
functions
Impact estimation
Cost estimates
Reproducibility
assessment
ReComp
Decision
Support
System
History of
Knowledge Assets
and their metadata
Re-computation
recommendations
TAPP’16
P.Missier,2016
ReComp concerns
1. Observability (transparency)
How much can we observe?
• Structure
• Data flow
2. Change detection: inputs, outputs, external resources
Can we quantify the extent of changes?  diff() functions
4. Control: reaction to changes
How much re-computation control do we have on the system?
Provenance
3. Impact assessment
Can we quantify knowledge decay?
Reproducibility
- Virtualisation
- Smart re-run
• Scope: Which instances?
• Frequency: how often?
• Re-run Extent: how much?
Change
Events
Diff(.,.)
functions
utility
functions
Impact estimation
Cost estimates
Reproducibility
assessment
ReComp
Decision
Support
System
TAPP’16
P.Missier,2016
Observability / transparency
White box Black box
Structure
(static view)
Dataflow
- eScience Central, Taverna,
VisTrails…
Scripting:
- R, Matlab, Python...
- Packaged components
- Third party services
Data
dependencies
(runtime
view)
Provenance recording:
• Inputs,
• Reference datasets,
• Component versions,
• Outputs
• Input
• Outputs
• No data dependencies
• No details on individual
components
Cost • Detailed resource monitoring
• Cloud  £££
• Wall clock time
• Service pricing
• Setup time (eg model
learning)
This talk: White box ReComp -- initial experiments
TAPP’16
P.Missier,2016
Example: genomics / variant interpretation
SVI is a classifier of likely variant deleteriousness:
y = {(v, class)|v ∈ varset, class ∈ {red, amber, green}}
Uncertain
diagnosis
Definitely
deleterious
Definitely
benign
TAPP’16
P.Missier,2016
OMIM and ClinVar changes
Sources of changes:
- Patient variants  improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources
CLINVAR / OMIM relevant changes over time for a patient cohort
(Newcastle Institute of Genetics Medicine)
TAPP’16
P.Missier,2016
x11
x12 y11
P
D11 D12
White box ReComp
For each run i:
Observables:
Inputs X = {xi1, x12, …}
Outputs y = {yi1, yi2,…}
Dependencies D11, D12, ...
Variable-granularity provenance prov(y)
Granular Cost(y)  single-block level
Granular Process structure P  workflow graph
TAPP’16
P.Missier,2016
White-box provenance
x11
x12 y11
P
D11 D12
Coarse:
Granular:
TAPP’16
P.Missier,2016
A history of runs
x11
x12 y1
P
D11 DCV
Run 1,
Patient A
x21
x22 y2
P
D21 DCV
Run 2,
Patient B
History database:
TAPP’16
P.Missier,2016
ReComp questions
• Scope: Which instances?
Which patients within the cohort are going to be affected by change in input/reference data?
• Re-run Extent: how much?
Where in each process instance is the reference data used?
• Impact: why bother?
For each patient in scope, how likely is that any patient’s diagnosis will change?
• Frequency: how often?
How often are updates available for the resources we depend on?
x11
x12 y11
P
D11 D12
TAPP’16
P.Missier,2016
Available Metadata
1. History DB
2. Measurables changes:
Input diff:  one patient at a time
Output diff:  has the change had any impact?
Dependencies  affects entire cohort  scoping
Example:
TAPP’16
P.Missier,2016
Querying provenance to determine Scope and Re-run Extent
Given observed changes in resources
History instance:
1. Scoping: For each
Case 1: Granular provenance
is in the scope S ⊆ H if
Pj is added to Pscope(y)
2. Re-run Extent:
1. Find a partial order on Pscope(y)
2. Re-run starts from each of the earliest Pj such that their output is
available as persistent intermediate result
see for instance Smart Run Manager [1]
[1] Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E., Tao, J., Zhao, Y.:
Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience,
Special Issue on Scientific Workflows, 2005, Wiley.
TAPP’16
P.Missier,2016
Querying provenance to determine Scope and Re-run Extent
Scoping: Any instance that depends on any Dij is in scope:
Pscope = {Pj}, where:
For each
Case 2: Coarse-grained provenance
Re-run Extent:
The mechanism from the fine-grained case still works
This is trivial for a homogenenous run population, but
H may contain run history for many different workflows!
TAPP’16
P.Missier,2016
Assessing impact and cost
Approach: small-scale re-comp over the population in scope
1. Sample instances S’ ⊆ S from the population in scope S
2. Perform partial re-run on each instance h(yi,v) ∈ S’,
generating new outputs yi’
3. Compute
4. Assess impact (user-defined) and cost(y’)
5. Estimate cost difference diff(cost(y), cost(y’))
TAPP’16
P.Missier,2016
ReComp user dashboard and architecture
ReComp decision dashboard
Execute
Curate
Select/
prioritise
prospective
provenance
curation
(Yworkflow)
Meta-Knowledge
Repository
Research
Objects
Change
Impact
Analysis
Cost
Estimation
Differential
Analysis
Reproducibility
Assessment
- Utility functions
- Priorities policies
- Data similarity functions
domain knowledge
runtime
monitor
Logging
Runtime
Provenance recorder
runtime
monitor
Logging
Runtime
Provenance recorder
Python
WP1
- provenance
- logs
- data and process versions
- process dependencies
(other analytics environments)
ReComp is a Decision Support System
Impact, cost assessment  ReComp user dashboard
TAPP’16
P.Missier,2016
Current status and Challenges
Implementation in progress
Small scale experiments on scoping / partial re-run
- Test cohort of about 50 (real) patients
- Short workflows runs (about 15 mins), observable cost savings
- (preliminary results)
Main challenge: deliver a generic and reusable DSS
From eScience Central  To generic dataflow, scripting (Python)
From
eSc prov traces  PROV-compliant but idiosincratic patterns
Python  noWorkflow traces
To: Canonical PROV patterns + queries + H DB implementation
ReComp: http://recomp.org.uk/
TAPP’16
P.Missier,2016
References
[1] Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M.,
Lee, E., Tao, J., Zhao, Y.: Scientific Workflow Management and the Kepler System.
Concurrency and Computation: Practice & Experience, Special Issue on Scientific
Workflows, 2005, Wiley.
[2] Ikeda, Robert, Semih Salihoglu, and Jennifer Widom. Provenance-Based Refresh
in Data-Oriented Workflows. In Procs CIKM, 2011
[3] R. Ikeda and J. Widom. Panda: A system for provenance and data. Procs
TaPP10, 33:1–8, 2010.
[4] D. Koop, E. Santos, B. Bauer, M. Troyer, J. Freire, and C. T. Silva. Bridging
workflow and data provenance using strong links. In Scientific and statistical
database management, pages 397–415. Springer, 2010. ISBN 3642138179.
[5] P. Missier, E. Wijaya, R. Kirby, and M. Keogh. SVI: a simple single-nucleotide
Human Variant Interpretation tool for Clinical Use. In Procs. 11th International
conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015.
Springer.

More Related Content

What's hot

Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
Paolo Missier
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...
Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
Paolo Missier
 
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Ioannis Katakis
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithmsFarhan Zaki
 
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
Pradeeban Kathiravelu, Ph.D.
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open data
Pablo Bernabeu
 
A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...
JPINFOTECH JAYAPRAKASH
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
Robert Grossman
 
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsNG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
Kan Yuenyong
 
Reusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureReusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize Agriculture
David LeBauer
 
La résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphesLa résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphes
Data2B
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
Greg Landrum
 
Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects
Globus
 
Optique presentation
Optique presentationOptique presentation
Optique presentation
DBOnto
 
Networking Materials Data
Networking Materials DataNetworking Materials Data
Networking Materials Data
Ian Foster
 

What's hot (20)

Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Project
ProjectProject
Project
 
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithms
 
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open data
 
A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsNG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
 
Reusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureReusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize Agriculture
 
La résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphesLa résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphes
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
 
Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects
 
Optique presentation
Optique presentationOptique presentation
Optique presentation
 
Networking Materials Data
Networking Materials DataNetworking Materials Data
Networking Materials Data
 

Similar to The data, they are a-changin’

Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...
Paolo Missier
 
YesWorkflow: Yes, Scripts can be Workflows, Too!
YesWorkflow: Yes, Scripts can be Workflows, Too!YesWorkflow: Yes, Scripts can be Workflows, Too!
YesWorkflow: Yes, Scripts can be Workflows, Too!
Bertram Ludäscher
 
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
The Statistical and Applied Mathematical Sciences Institute
 
HOP-Rec_RecSys18
HOP-Rec_RecSys18HOP-Rec_RecSys18
HOP-Rec_RecSys18
Matt Yang
 
Open Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysisOpen Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysis
Antica Culina
 
Learning Pulse - paper presentation at LAK17
Learning Pulse - paper presentation at LAK17Learning Pulse - paper presentation at LAK17
Learning Pulse - paper presentation at LAK17
Daniele Di Mitri
 
Morgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distMorgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 dist
ddm314
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Paolo Missier
 
PoemTapp16
PoemTapp16PoemTapp16
PoemTapp16
hala Skaf
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
Paolo Missier
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
Khalid Belhajjame
 
Building the next generation of statistical tools for outbreak response using R
Building the next generation of statistical tools for outbreak response using RBuilding the next generation of statistical tools for outbreak response using R
Building the next generation of statistical tools for outbreak response using R
European Centre for Disease Prevention and Control
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
Ian Foster
 
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
Niki Pavlopoulou
 
A Non Parametric Estimation Based Underwater Target Classifier
A Non Parametric Estimation Based Underwater Target ClassifierA Non Parametric Estimation Based Underwater Target Classifier
A Non Parametric Estimation Based Underwater Target Classifier
CSCJournals
 
Scientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing SystemsScientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing Systems
inside-BigData.com
 
Power and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesPower and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar Slides
nQuery
 
PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx
PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczxPCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx
PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx
JuanManuelNasralaAlv1
 
Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...
Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...
Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...
Hui Yang
 
Preservation And Reuse In High Energy Physics Salvatore Mele
Preservation And Reuse In High Energy Physics Salvatore MelePreservation And Reuse In High Energy Physics Salvatore Mele
Preservation And Reuse In High Energy Physics Salvatore Mele
DigitalPreservationEurope
 

Similar to The data, they are a-changin’ (20)

Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...
 
YesWorkflow: Yes, Scripts can be Workflows, Too!
YesWorkflow: Yes, Scripts can be Workflows, Too!YesWorkflow: Yes, Scripts can be Workflows, Too!
YesWorkflow: Yes, Scripts can be Workflows, Too!
 
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
 
HOP-Rec_RecSys18
HOP-Rec_RecSys18HOP-Rec_RecSys18
HOP-Rec_RecSys18
 
Open Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysisOpen Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysis
 
Learning Pulse - paper presentation at LAK17
Learning Pulse - paper presentation at LAK17Learning Pulse - paper presentation at LAK17
Learning Pulse - paper presentation at LAK17
 
Morgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distMorgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 dist
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
PoemTapp16
PoemTapp16PoemTapp16
PoemTapp16
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
Building the next generation of statistical tools for outbreak response using R
Building the next generation of statistical tools for outbreak response using RBuilding the next generation of statistical tools for outbreak response using R
Building the next generation of statistical tools for outbreak response using R
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
 
A Non Parametric Estimation Based Underwater Target Classifier
A Non Parametric Estimation Based Underwater Target ClassifierA Non Parametric Estimation Based Underwater Target Classifier
A Non Parametric Estimation Based Underwater Target Classifier
 
Scientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing SystemsScientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing Systems
 
Power and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesPower and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar Slides
 
PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx
PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczxPCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx
PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx
 
Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...
Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...
Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...
 
Preservation And Reuse In High Energy Physics Salvatore Mele
Preservation And Reuse In High Energy Physics Salvatore MelePreservation And Reuse In High Energy Physics Salvatore Mele
Preservation And Reuse In High Energy Physics Salvatore Mele
 

More from Paolo Missier

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Paolo Missier
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
Paolo Missier
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
Paolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
Paolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Paolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
Paolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
Paolo Missier
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
Paolo Missier
 

More from Paolo Missier (20)

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 

The data, they are a-changin’

  • 1. TAPP’16 P.Missier,2016 The data, they are a-changin’ (ReComp: Your Data Will Not Stay Smart Forever) Paolo Missier, Jacek Cala, Eldarina Wijaya School of Computing Science, Newcastle University {firstname.lastname}@ncl.ac.uk TAPP’16 McLean, VA, USA June, 2016 (*) Painting by Johannes Moreelse (*) Panta Rhei (Heraclitus, through Plato)
  • 2. TAPP’16 P.Missier,2016 Data to Knowledge Lots of Data Big Analytics Machine “Valuable Knowledge” Meta-knowledge Algorithms Tools Middleware Reference datasets
  • 3. TAPP’16 P.Missier,2016 The missing element: time Lots of Data Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Your Data Will Not Stay Smart Forever
  • 4. TAPP’16 P.Missier,2016 ReComp Observe change • In input data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Lots of Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  • 5. TAPP’16 P.Missier,2016 The ReComp decision support system Observe change Assess and measure Estimate Enact Change Events Diff(.,.) functions utility functions Impact estimation Cost estimates Reproducibility assessment ReComp Decision Support System History of Knowledge Assets and their metadata Re-computation recommendations
  • 6. TAPP’16 P.Missier,2016 ReComp concerns 1. Observability (transparency) How much can we observe? • Structure • Data flow 2. Change detection: inputs, outputs, external resources Can we quantify the extent of changes?  diff() functions 4. Control: reaction to changes How much re-computation control do we have on the system? Provenance 3. Impact assessment Can we quantify knowledge decay? Reproducibility - Virtualisation - Smart re-run • Scope: Which instances? • Frequency: how often? • Re-run Extent: how much? Change Events Diff(.,.) functions utility functions Impact estimation Cost estimates Reproducibility assessment ReComp Decision Support System
  • 7. TAPP’16 P.Missier,2016 Observability / transparency White box Black box Structure (static view) Dataflow - eScience Central, Taverna, VisTrails… Scripting: - R, Matlab, Python... - Packaged components - Third party services Data dependencies (runtime view) Provenance recording: • Inputs, • Reference datasets, • Component versions, • Outputs • Input • Outputs • No data dependencies • No details on individual components Cost • Detailed resource monitoring • Cloud  £££ • Wall clock time • Service pricing • Setup time (eg model learning) This talk: White box ReComp -- initial experiments
  • 8. TAPP’16 P.Missier,2016 Example: genomics / variant interpretation SVI is a classifier of likely variant deleteriousness: y = {(v, class)|v ∈ varset, class ∈ {red, amber, green}} Uncertain diagnosis Definitely deleterious Definitely benign
  • 9. TAPP’16 P.Missier,2016 OMIM and ClinVar changes Sources of changes: - Patient variants  improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources CLINVAR / OMIM relevant changes over time for a patient cohort (Newcastle Institute of Genetics Medicine)
  • 10. TAPP’16 P.Missier,2016 x11 x12 y11 P D11 D12 White box ReComp For each run i: Observables: Inputs X = {xi1, x12, …} Outputs y = {yi1, yi2,…} Dependencies D11, D12, ... Variable-granularity provenance prov(y) Granular Cost(y)  single-block level Granular Process structure P  workflow graph
  • 12. TAPP’16 P.Missier,2016 A history of runs x11 x12 y1 P D11 DCV Run 1, Patient A x21 x22 y2 P D21 DCV Run 2, Patient B History database:
  • 13. TAPP’16 P.Missier,2016 ReComp questions • Scope: Which instances? Which patients within the cohort are going to be affected by change in input/reference data? • Re-run Extent: how much? Where in each process instance is the reference data used? • Impact: why bother? For each patient in scope, how likely is that any patient’s diagnosis will change? • Frequency: how often? How often are updates available for the resources we depend on? x11 x12 y11 P D11 D12
  • 14. TAPP’16 P.Missier,2016 Available Metadata 1. History DB 2. Measurables changes: Input diff:  one patient at a time Output diff:  has the change had any impact? Dependencies  affects entire cohort  scoping Example:
  • 15. TAPP’16 P.Missier,2016 Querying provenance to determine Scope and Re-run Extent Given observed changes in resources History instance: 1. Scoping: For each Case 1: Granular provenance is in the scope S ⊆ H if Pj is added to Pscope(y) 2. Re-run Extent: 1. Find a partial order on Pscope(y) 2. Re-run starts from each of the earliest Pj such that their output is available as persistent intermediate result see for instance Smart Run Manager [1] [1] Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows, 2005, Wiley.
  • 16. TAPP’16 P.Missier,2016 Querying provenance to determine Scope and Re-run Extent Scoping: Any instance that depends on any Dij is in scope: Pscope = {Pj}, where: For each Case 2: Coarse-grained provenance Re-run Extent: The mechanism from the fine-grained case still works This is trivial for a homogenenous run population, but H may contain run history for many different workflows!
  • 17. TAPP’16 P.Missier,2016 Assessing impact and cost Approach: small-scale re-comp over the population in scope 1. Sample instances S’ ⊆ S from the population in scope S 2. Perform partial re-run on each instance h(yi,v) ∈ S’, generating new outputs yi’ 3. Compute 4. Assess impact (user-defined) and cost(y’) 5. Estimate cost difference diff(cost(y), cost(y’))
  • 18. TAPP’16 P.Missier,2016 ReComp user dashboard and architecture ReComp decision dashboard Execute Curate Select/ prioritise prospective provenance curation (Yworkflow) Meta-Knowledge Repository Research Objects Change Impact Analysis Cost Estimation Differential Analysis Reproducibility Assessment - Utility functions - Priorities policies - Data similarity functions domain knowledge runtime monitor Logging Runtime Provenance recorder runtime monitor Logging Runtime Provenance recorder Python WP1 - provenance - logs - data and process versions - process dependencies (other analytics environments) ReComp is a Decision Support System Impact, cost assessment  ReComp user dashboard
  • 19. TAPP’16 P.Missier,2016 Current status and Challenges Implementation in progress Small scale experiments on scoping / partial re-run - Test cohort of about 50 (real) patients - Short workflows runs (about 15 mins), observable cost savings - (preliminary results) Main challenge: deliver a generic and reusable DSS From eScience Central  To generic dataflow, scripting (Python) From eSc prov traces  PROV-compliant but idiosincratic patterns Python  noWorkflow traces To: Canonical PROV patterns + queries + H DB implementation ReComp: http://recomp.org.uk/
  • 20. TAPP’16 P.Missier,2016 References [1] Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows, 2005, Wiley. [2] Ikeda, Robert, Semih Salihoglu, and Jennifer Widom. Provenance-Based Refresh in Data-Oriented Workflows. In Procs CIKM, 2011 [3] R. Ikeda and J. Widom. Panda: A system for provenance and data. Procs TaPP10, 33:1–8, 2010. [4] D. Koop, E. Santos, B. Bauer, M. Troyer, J. Freire, and C. T. Silva. Bridging workflow and data provenance using strong links. In Scientific and statistical database management, pages 397–415. Springer, 2010. ISBN 3642138179. [5] P. Missier, E. Wijaya, R. Kirby, and M. Keogh. SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer.

Editor's Notes

  1. The problem of sleective recomputaion summarises the main problems in computational reproducibility. This seems too broad, so we need to fous on specific regions in this problem space. We do this through a running example SVI: workflow, white box, many observables, control over provenance traces
  2. that associates a class label to each input variant depending on their estimated dele- teriousness, using a simple “traffic light” notation
  3. \diffin(x_i^v, x_i^{v'})
  4. \diffd(D_i^v, D_i^{v'}) d_{ij} \in \diffd(D_i^v, D_i^{v'}) \texttt{used}(P_j, d_{ij}, [\texttt{prov:role} = \texttt{'dep'}]) \in \mathit{prov(\y^v)}
  5. \diffd(D_i^v, D_i^{v'}) d_{ij} \in \diffd(D_i^v, D_i^{v'}) \texttt{used}(P_j, \D_i, [\texttt{prov:role} = \texttt{'dep'}]), \quad d_{ij} \in D_i
  6. \diffo(y_i^v, y_i^{v'})