-
1.
ReCompkickoff
NewcaslteMarch11,2016
ReComp: preserving the value of big data insights over time
Panta Rhei (Heraclitus, through Plato)
Project Kickoff
March, 2016
(*) Painting by Johannes Moreelse
(*)
-
2.
ReCompkickoff
NewcaslteMarch11,2016
Data to Knowledge
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
The Data-to-knowledge axiom of the Knowledge Economy:
What is the Total Cost of Ownership (TCO) of these knowledge assets?
-
3.
ReCompkickoff
NewcaslteMarch11,2016
Learning from data (supervised)
Meta-knowledge
Training
set
Model
learning
Classification
algorithms
Predictive
classifier
Background
Knowledge
(prior)
-
4.
ReCompkickoff
NewcaslteMarch11,2016
Learning from data (unsupervised)
Meta-knowledge
Observations Model
learning
Clustering
algorithms
Clustering
scheme
Background
Knowledge
-
5.
ReCompkickoff
NewcaslteMarch11,2016
Stream Analytics
Meta-knowledge
Data
stream
Time
Series
analysis
Pattern
recognition
algorithms
Temporal
Patterns
Background
Knowledge
-
6.
ReCompkickoff
NewcaslteMarch11,2016
The missing element: time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
-
7.
ReCompkickoff
NewcaslteMarch11,2016
The ReComp decision support system
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
-
8.
ReCompkickoff
NewcaslteMarch11,2016
ReComp
Change
Events
Diff(.,.)
functions
“business
Rules”
Prioritised KAs
Cost estimates
Reproducibility
assessment
ReComp DSS
Previously
Computed KAs
And their metadata
Observe
change
Assess and
measure
Estimate
Enact
KA: Knowledge Assets
-
9.
ReCompkickoff
NewcaslteMarch11,2016
ReComp: project objectives
Obj 1.
To investigate analytics techniques aimed at supporting re-computation decisions
Obj 2.
To research techniques for assessing under what conditions it is practically feasible
to re-compute an analytical process.
• Specific target system environments:
• Python / Jupyter
• The eScience Central, workflow manager
Obj 3.
To create a decision support system for the selective recomputation of complex
data-centric analytical processes and demonstrate its viability on two target case
studies
• Genomics (human variant analysis)
• Urban Observatory (flood modelling)
-
10.
ReCompkickoff
NewcaslteMarch11,2016
ReComp: Expected outcomes
Research Outcomes:
Algorithms that operate on metadata to perform:
• impact analysis
• cost estimation
• differential data and change cause analysis of past and new knowledge
outcomes
• estimation of reproducibility effort
System Outcomes:
• A software framework consisting of domain-independent, reusable components,
which implement the metadata infrastructure and the research outcomes
• A user-facing decision support dashboard.
It must be possible to integrate the framework with domain-specific components, to
support specific scenarios, exemplified by our case studies.
-
11.
ReCompkickoff
NewcaslteMarch11,2016
ReComp: Target operating region
Rate of change
Volume
slowfast
low
high
Cost
Volume
highlow
low
high
Volume
Rate of change
ReComp target region
-
12.
ReCompkickoff
NewcaslteMarch11,2016
Recomputation analysis: abstraction
t1 t2 t3
KA5
KA4
KA3
KA2
KA1a b c
a b
d
a b c d
a c
Change
Events
a a’
a
a
a
a
b b’
c c’
b,c
b
b,c
c
-
13.
ReCompkickoff
NewcaslteMarch11,2016
Recomputation analysis through sampling
Change
Events
Monitor
identify
recomp
candidates
prioritisation budgetutility
assess
effects of
change
estimate
recomp
cost
assess
reproducibility
cost
sampling
recomp
recompsmall scale
recomp
Meta-K
large-scale
recomp
estimate
recomp
cost
-
14.
ReCompkickoff
NewcaslteMarch11,2016
Recomputation analysis through modelling
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
prioritisation
target
population
utility budget
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Change impact model: Δ(x,x’) Δ(y,y’)
-- challenging!!
Can we do better??
- Batching given an
allocation of resources?
- Consolidating jobs with
different resource reqs to
optimise resource
allocation
-
15.
ReCompkickoff
NewcaslteMarch11,2016
Metadata + Analytics
The knowledge is
in the metadata!
Research hypothesis:
supporting the analysis can be achieved through analytical reasoning applied to a
collection of metadata items, which describe details of past computations.
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Meta-K • Logs
• Provenance
• Dependencies
-
16.
ReCompkickoff
NewcaslteMarch11,2016
High level architecture
ReComp decision dashboard
Execute
Curate
Select/
prioritise
prospective
provenance
curation
(Yworkflow)
Meta-Knowledge
Repository
Research
Objects
Change
Impact
Analysis
Cost
Estimation
Differential
Analysis
Reproducibility
Assessment
- Utility functions
- Priorities policies
- Data similarity functions
domain knowledge
runtime
monitor
Logging
Runtime
Provenance recorder
runtime
monitor
Logging
Runtime
Provenance recorder
Python
WP1
- provenance
- logs
- data and process versions
- process dependencies
(other analytics environments)
-
17.
ReCompkickoff
NewcaslteMarch11,2016
Available technology components
• W3C PROV model for describing data dependencies (provenance)
• DataONE “metacat” for data and metadata management
• The eScience Central Workflow Management System
• Natively provenance-aware
• NoWorkflow: a (experimental) Python provenance recorder
• Cloud resources:
• Azure, our own private cloud (CIC)
ReComp decision dashboard
Execute
Curate
Select/
prioritise
prospective
provenance
curation
(Yworkflow)
Meta-Knowledge
Repository
Research
Objects
Change
Impact
Analysis
Cost
Estimation
Differential
Analysis
Reproducibility
Assessment
- Utility functions
- Priorities policies
- Data similarity functions
domain knowledge
runtime
monitor
Logging
Runtime
Provenance recorder
runtime
monitor
Logging
Runtime
Provenance recorder
Python
WP1
- provenance
- logs
- data and process versions
- process dependencies
(other analytics environments)
-
18.
ReCompkickoff
NewcaslteMarch11,2016
Challenge 1: estimating impact and cost
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
prioritisation
target
population
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Change impact model: Δ(x,x’) Δ(y,y’)
-- challenging!!
-
19.
ReCompkickoff
NewcaslteMarch11,2016
Challenge 2: managing the metadata
How do we generate / capture / store / index / query across multiple metadata
types and formats?
Relevant Metadata:
• Logs of past executions, automatically collected;
• Provenance traces:
• Runtime (“retrospective”) provenance
• Automatically collected data dependency graph captured from the
computation
• Process structure (“prospective provenance”)
• obtained by manually annotating a script
• External data and system dependencies, process and data versions, and system
requirements
-
20.
ReCompkickoff
NewcaslteMarch11,2016
Challenge 3: Reproducibility
Ex.: workflow to
Identify mutations
in a patient’s
genome
Workflow
specification
WF manager
Linux VM
cluster on
Azure
Analyse
Input genome
variant
s
GATK/Picard/BWA
Workflow Manager
(and its own dependencies)
Ubuntu
on Azure
Dep.
Input
genome config
Ref
genome
Variants
DBs
What happens when any of the dependencies change?
-
21.
ReCompkickoff
NewcaslteMarch11,2016
Challenge 4: reusability of the solution across cases
• How do we make case-specific solutions generic?
• How do we make the DSS reusable?
• Refactor: Generic framework + case-specific components
• This is hard: most elements are case-specific!
• Metadata formats
• Metadata capture
• Change impact
• Cost models
• Utility functions
• …
The times they are a’changin
S1. identifying re-computation candidates and understand the impact of changes and in Information Assets on a corpus of knowledge outcomes: which outcomes are affected by the changes, and to what extent? This step defines the target re-computation population;
S2. Estimate effects, costs and benefits of re-computation, across the target population (S1);
S3. Establish re-computation priorities within the target population, based on a budget for computational resources, a problem-specific definition of utility functions and prioritisation policy, and estimates as in (S2);
S4. Selectively carry out priority re-computations, when the processes are reproducible;S5. Differential data analysis and change cause analysis: Assess the effects of the re- computation. This involves understanding how the new outcomes differ from the original (differential data analysis), and which of the changes in the process are responsible for the changes observed in
ReComp
the outcomes (change cause analysis). The latter analysis helps data scientists understand the actual effect of an improved process “post hoc”, and has also the potential to improve future effect estimates.
Problem: this is “blind” and expensive. Can we do better?
These items are partly collected automatically, and partly as manual annotations. They include:
Logs of past executions, automatically collected, to be used for post hoc performance analysis and
estimation of future resource requirements and thus costs (S1) ;
Runtime provenance traces and prospective provenance. The former are automatically
collected graphs of data dependencies, captured from the computation [11]. The latter are formal descriptions of the analytics process, obtained from the workflow specification, or more generally by manually annotating a script. Both are instrumental to understanding how the knowledge outcomes have changed and why (S5), as well as to estimate future re-computation effects.
External data and system dependencies, process and data versions, and system requirements associated with the analytics process, which are used to understand whether it will be practically possible to re-compute the process.