Your data won’t stay smart forever:exploring the temporal dimension of (big data) analytics

ReComp@Scalable
Newcaslte,May9,2016
Your data won’t stay smart forever:
exploring the temporal dimension of (big data) analytics
Paolo Missier, Jacek Cala, Manisha Rathi
Scalable Computing Group Seminar
Newcastle, May 2016
(*) Painting by Johannes Moreelse
(*)
Panta Rhei (Heraclitus)

ReComp@Scalable
Newcaslte,May9,2016
Data to Knowledge
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
The Data-to-knowledge axiom of the Knowledge Economy:

ReComp@Scalable
Newcaslte,May9,2016
The missing element: time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t

ReComp@Scalable
Newcaslte,May9,2016
The ReComp decision support system
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t

ReComp@Scalable
Newcaslte,May9,2016
ReComp
Change
Events
Diff(.,.)
functions
“business
Rules”
Prioritised KAs
Cost estimates
Reproducibility
assessment
ReComp
Decision
Support
System
Previously
Computed KAs
And their metadata
Observe
change
Assess and
measure
Estimate
Enact
KA: Knowledge Assets

ReComp@Scalable
Newcaslte,May9,2016
ReComp scenarios
ReComp scenario Target Impact areas Why is ReComp
relevant?
Proof of concept
experiments
Expected
optimisation
Dataflow,
experimental
science
Genomics - Rapid Knowledge
advances
- Rapid scaling up
of genetic testing
at population level
WES/SVI pipeline,
workflow
implementation
(eScience Central)
Timeliness and
accuracy of patient
diagnosis subject to
budget constraints
Time series analysis - Personal health
monitoring
- Smart city
analytics
- IoT data streams
- Rapid data drift
- Cost of computation
at network edge (eg
IoT)
NYC taxi rides
challenge (DEBS’15)
Use of low-power
edge devices when
outcome is
predictable and data
drift is low
Data layer
optimisation
Tuning of large-scale
Data management
stack
Optimal Data
organisation sensitive
to current data
profiles
Graph DB re-
partitioning
System throughput vs
cost of re-tuning
Model learning Applications of
predictive analytics
Predictive models are
very sensitive to data
drift
Twitter content
analysis
Sustained model
predictive power over
time vs retraining
cost
Simulation TBD repeated simulation.
Computationally
expensive but often
not beneficial
Flood modelling /
CityCat Newcastle
Computational
resources vs
marginal benefit of
new simulation model

ReComp@Scalable
Newcaslte,May9,2016
Data-intensive systems: two properties
1. Observability (transparency)
How much of a data-intensive system can we observe?
• structure + data flow
2. Control
How much control do we have on the system?
• Execution frequency, total, partial
• Input density

ReComp@Scalable
Newcaslte,May9,2016
Observability / transparency
White box Black box
Structure
(static view)
Dataflow
- eScience Central, Taverna, VisTrails…
Scripting:
- R, Matlab, Python...
- Functions semantics
- Packaged components
- Third party services
Data
dependencies
(runtime
view)
Provenance recording:
• Inputs,
• Reference datasets,
• Component versions,
• Outputs
• Input
• Outputs
• No data dependencies
• No details on individual
components
Cost • Detailed resource monitoring
• Cloud  £££
• Wall clock time
• Service pricing
• Setup time (eg model
learning)

ReComp@Scalable
Newcaslte,May9,2016
Example: genomics / variant interpretation
What changes:
- Patient variants  improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources

ReComp@Scalable
Newcaslte,May9,2016
x11
x12 y11
P
D11 D12
White box ReComp
Observables:
Inputs X = {xi1, x12, …}
Outputs y = {yi1, yi2,…}
Dependencies D11, D12, ...
Provenance prov(y)
Cost(y)
Process structure P = P1.P2…Pk
Measurables changes:
Input diff: δ(xt, xt+1)
Output diff: δ(yt, yt+1)
Dependency diff: δ(Dt, Dt+1)
For each run i:
Control:
Complete / partial rerun

ReComp@Scalable
Newcaslte,May9,2016
A history of runs
x11
x12 y1
P
D11 DCV
Run 1,
Patient A
x21
x22 y2
P
D21 DCV
Run 2,
Patient B
H = {<x,D,P,y, prov(y), cost(y)>}
Note on dependencies Dij:
Fine-grained: Dij are the results of a query to a dependent data source (OMIM)
Coarse-grained: only record that (a specific version of) OMIM has been used

ReComp@Scalable
Newcaslte,May9,2016
ReComp questions
• Scoping:
Which patients (from a large cohort) are going to be affected by
change in input/reference data?
• Impact:
For each patient in scope, how likely is each patient’s diagnosis to
change?
Approach:
Given Dt+1 and changes δ(Dt, Dt+1):
For each patient X and outcome y, query prov(y) to find references
to Dt.
Patient X is in scope if prov(y) ∩ δ(Dt, Dt+1) is not empty
Paolo, Missier, Cala Jacek, and Eldarina Wijaya. “The Data, They Are a-Changin’.” In
Proc. TAPP’16 (Theory and Practice of Provenance), edited by Sarah Cohen-Boulakia.
Washington D.C., USA: USENIX Association, 2016. https://arxiv.org/abs/1604.06412.

ReComp@Scalable
Newcaslte,May9,2016
Control:
Simulation rerun
Grid resolution
Regional boundaries
Observables:
Cost(y)
CityCAT (City Catchment Analysis Tool)
Inputs
• Topography (DEMs from LIDAR)
• Physical structures (buildings etc.)
• Landuse data
Outputs
• high resolution grid of flood depths
Example: flood modelling in Newcastle

ReComp@Scalable
Newcaslte,May9,2016
Example: model learning pattern
Observables:
Cost(y) (retraining)
Output quality relative to
ground truth: qty(yt)
Black box ReComp:
Control:
Request to retrain

ReComp@Scalable
Newcaslte,May9,2016
Black box ReComp – I
• When does the model require retraining?
• What is the expected cost and benefit of re-training the model at
any given time?
The Velox system / Berkeley AMP Lab:
Crankshaw, Daniel, Peter Bailis, Joseph E Gonzalez, Haoyuan Li, Zhao Zhang, Michael J Franklin, Ali
Ghodsi, and Michael I Jordan. “The Missing Piece in Complex Analytics: Low Latency, Scalable
Model Management and Serving with Velox.” In Procs CIDR 2015, Seventh Biennial Conference on
Innovative Data Systems Research, Asilomar, CA,
USAhttp://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper19u.pdf.

ReComp@Scalable
Newcaslte,May9,2016
Example: time series pattern
Control input data drift:
• How often and how densely should we sample from the stream to
keep the output sufficiently current?
Black box ReComp – top-k most frequent NYC taxi routes over time
Observables:
Cost(y)
Control:
Sample frequency /
Sample density

ReComp@Scalable
Newcaslte,May9,2016
Example: graph repartitioning
• Taper Partitioner optimises for a given query
workload
• Performance of a partitioning is well defined
• # inter-partition traversals
• Performance degrades when query workload
changes
0
500000
1000000
1500000
2000000
2500000
3000000
0% 20% 40% 60% 80% 100% 120%
Inter-partitiontraversals
% Workload change
Inter-partition traversals vs. %
Workload change
Observables:
Cost(y) (re-partition)
Control:
Re-partition requests
Measurables
changes:
Output quality: #ipt

ReComp@Scalable
Newcaslte,May9,2016
A summary of ReComp problems
Observables:
Dependencies D11, D12, ...
Provenance prov(y)
Cost(y)
Process structure P = P1.P2…Pk
Dependency diff: δ(Dt, Dt+1)
Quality(y)
Control:
- Data selection
- Partial / complete rerun
Problem Requires Sample problems
Forwards
Scoping:
Identify affected population
subset
White box
• Dataflow analytics,
• Simulation
• Many-runs problemsChange Impact analysis:
inputs, dependencies 
outputs
Backwards React to output instability /
input drift
Black box - Model learning
- Time series analytics
- Data-driven optimisation

ReComp@Scalable
Newcaslte,May9,2016
The next steps -- challenges
1. Optimisation:
Observables + control  reactive system
+ cost and utility functions  optimisation problems
2. Learning from history
Can we use history to learn estimates of impact without the need for
actual re-computation?
3. Software infrastructure and tooling
ReComp is a metadata management and analytics exercise
4. Reproducibility:
What really happens when I press the “ReComp” button?
5. Impact:
How do address key impact areas
- e-health
- Genomics
- Smart city management

Your data won’t stay smart forever:exploring the temporal dimension of (big data) analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Your data won’t stay smart forever:exploring the temporal dimension of (big data) analytics

Similar to Your data won’t stay smart forever:exploring the temporal dimension of (big data) analytics (20)

More from Paolo Missier

More from Paolo Missier (20)

Recently uploaded

Recently uploaded (20)