ReComp project kickoff presentation 11-03-2016

ReCompkickoff
NewcaslteMarch11,2016
ReComp: preserving the value of big data insights over time
Panta Rhei (Heraclitus, through Plato)
Project Kickoff
March, 2016
(*) Painting by Johannes Moreelse
(*)

ReCompkickoff
Data to Knowledge
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
The Data-to-knowledge axiom of the Knowledge Economy:
What is the Total Cost of Ownership (TCO) of these knowledge assets?

ReCompkickoff
Learning from data (supervised)
Meta-knowledge
Training
set
Model
learning
Classification
algorithms
Predictive
classifier
Background
Knowledge
(prior)

ReCompkickoff
Learning from data (unsupervised)
Meta-knowledge
Observations Model
learning
Clustering
algorithms
Clustering
scheme
Background
Knowledge

ReCompkickoff
Stream Analytics
Meta-knowledge
Data
stream
Time
Series
analysis
Pattern
recognition
algorithms
Temporal
Patterns
Background
Knowledge

ReCompkickoff
The missing element: time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t

ReCompkickoff
The ReComp decision support system
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t

ReCompkickoff
ReComp
Change
Events
Diff(.,.)
functions
“business
Rules”
Prioritised KAs
Cost estimates
Reproducibility
assessment
ReComp DSS
Previously
Computed KAs
And their metadata
Observe
change
Assess and
measure
Estimate
Enact
KA: Knowledge Assets

ReCompkickoff
ReComp: project objectives
Obj 1.
To investigate analytics techniques aimed at supporting re-computation decisions
Obj 2.
To research techniques for assessing under what conditions it is practically feasible
to re-compute an analytical process.
• Specific target system environments:
• Python / Jupyter
• The eScience Central, workflow manager
Obj 3.
To create a decision support system for the selective recomputation of complex
data-centric analytical processes and demonstrate its viability on two target case
studies
• Genomics (human variant analysis)
• Urban Observatory (flood modelling)

ReCompkickoff
ReComp: Expected outcomes
Research Outcomes:
Algorithms that operate on metadata to perform:
• impact analysis
• cost estimation
• differential data and change cause analysis of past and new knowledge
outcomes
• estimation of reproducibility effort
System Outcomes:
• A software framework consisting of domain-independent, reusable components,
which implement the metadata infrastructure and the research outcomes
• A user-facing decision support dashboard.
It must be possible to integrate the framework with domain-specific components, to
support specific scenarios, exemplified by our case studies.

ReCompkickoff
ReComp: Target operating region
Rate of change
Volume
slowfast
low
high
Cost
Volume
highlow
low
high
Volume
Rate of change
ReComp target region

ReCompkickoff
Recomputation analysis: abstraction
t1 t2 t3
KA5
KA4
KA3
KA2
KA1a b c
a b
d
a b c d
a c
Change
Events
a a’
a
a
a
a
b b’
c c’
b,c
b
b,c
c

ReCompkickoff
Recomputation analysis through sampling
Change
Events
Monitor
identify
recomp
candidates
prioritisation budgetutility
assess
effects of
change
estimate
recomp
cost
assess
reproducibility
cost
sampling
recomp
recompsmall scale
recomp
Meta-K
large-scale
recomp
estimate
recomp
cost

ReCompkickoff
Recomputation analysis through modelling
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
prioritisation
target
population
utility budget
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Change impact model: Δ(x,x’)  Δ(y,y’)
-- challenging!!
Can we do better??
- Batching given an
allocation of resources?
- Consolidating jobs with
different resource reqs to
optimise resource
allocation

ReCompkickoff
Metadata + Analytics
The knowledge is
in the metadata!
Research hypothesis:
supporting the analysis can be achieved through analytical reasoning applied to a
collection of metadata items, which describe details of past computations.
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Meta-K • Logs
• Provenance
• Dependencies

ReCompkickoff
High level architecture
ReComp decision dashboard
Execute
Curate
Select/
prioritise
prospective
provenance
curation
(Yworkflow)
Meta-Knowledge
Repository
Research
Objects
Change
Impact
Analysis
Cost
Estimation
Differential
Analysis
Reproducibility
Assessment
- Utility functions
- Priorities policies
- Data similarity functions
domain knowledge
runtime
monitor
Logging
Runtime
Provenance recorder
runtime
monitor
Logging
Runtime
Provenance recorder
Python
WP1
- provenance
- logs
- data and process versions
- process dependencies
(other analytics environments)

ReCompkickoff
Available technology components
• W3C PROV model for describing data dependencies (provenance)
• DataONE “metacat” for data and metadata management
• The eScience Central Workflow Management System
• Natively provenance-aware
• NoWorkflow: a (experimental) Python provenance recorder
• Cloud resources:
• Azure, our own private cloud (CIC)
ReComp decision dashboard
Execute
Curate
Select/
prioritise
prospective
provenance
curation
(Yworkflow)
Meta-Knowledge
Repository
Research
Objects
Change
Impact
Analysis
Cost
Estimation
Differential
Analysis
Reproducibility
Assessment
- Utility functions
- Priorities policies
- Data similarity functions
domain knowledge
runtime
monitor
Logging
Runtime
Provenance recorder
runtime
monitor
Logging
Runtime
Provenance recorder
Python
WP1
- provenance
- logs
- data and process versions
- process dependencies
(other analytics environments)

ReCompkickoff
Challenge 1: estimating impact and cost
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
prioritisation
target
population
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Change impact model: Δ(x,x’)  Δ(y,y’)
-- challenging!!

ReCompkickoff
Challenge 2: managing the metadata
How do we generate / capture / store / index / query across multiple metadata
types and formats?
Relevant Metadata:
• Logs of past executions, automatically collected;
• Provenance traces:
• Runtime (“retrospective”) provenance
• Automatically collected data dependency graph captured from the
computation
• Process structure (“prospective provenance”)
• obtained by manually annotating a script
• External data and system dependencies, process and data versions, and system
requirements

ReCompkickoff
Challenge 3: Reproducibility
Ex.: workflow to
Identify mutations
in a patient’s
genome
Workflow
specification
WF manager
Linux VM
cluster on
Azure
Analyse
Input genome
variant
s
GATK/Picard/BWA
Workflow Manager
(and its own dependencies)
Ubuntu
on Azure
Dep.
Input
genome config
Ref
genome
Variants
DBs
What happens when any of the dependencies change?

ReCompkickoff
Challenge 4: reusability of the solution across cases
• How do we make case-specific solutions generic?
• How do we make the DSS reusable?
• Refactor: Generic framework + case-specific components
• This is hard: most elements are case-specific!
• Metadata formats
• Metadata capture
• Change impact
• Cost models
• Utility functions
• …

ReComp project kickoff presentation 11-03-2016

More Related Content

What's hot

Similar to ReComp project kickoff presentation 11-03-2016

More from Paolo Missier

Recently uploaded

ReComp project kickoff presentation 11-03-2016

Editor's Notes