Much of the knowledge produced through data-intensive computations is liable to decay over time, as the underlying data drifts, and the algorithms, tools, and external data sources used for processing change and evolve. Your genome, for example, does not change over time, but our understanding of it does. How often should be look back at it, in the hope to gain new insight e.g. into genetic diseases, and how much does that cost when you scale re-analysis to an entire population?
The "total cost of ownership” of knowledge derived from data (TCO-DK) includes the cost of refreshing the knowledge over time in addition to the initial analysis, but is often not a primary consideration.
The ReComp project aims to provide models, algorithms, and tools to help humans understand TCO-DK, i.e., the nature and impact of changes in data, and assess the cost and benefits of knowledge refresh.
In this talk we try and map the scope of ReComp, by giving a number of patterns that cover typical analytics scenarios where re-computation is appropriate. We specifically describe two such scenarios, where we are conducting small scale, proof-of-concept ReComp experiments to help us sketch the general ReComp architecture. This initial exercise reveals a multiplicity of problems and research challenges, which will inform the rest of the project
WordPress Websites for Engineers: Elevate Your Brand
Your data won’t stay smart forever:exploring the temporal dimension of (big data) analytics
1. ReComp@Scalable
Newcaslte,May9,2016
Your data won’t stay smart forever:
exploring the temporal dimension of (big data) analytics
Paolo Missier, Jacek Cala, Manisha Rathi
Scalable Computing Group Seminar
Newcastle, May 2016
(*) Painting by Johannes Moreelse
(*)
Panta Rhei (Heraclitus)
4. ReComp@Scalable
Newcaslte,May9,2016
The ReComp decision support system
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
6. ReComp@Scalable
Newcaslte,May9,2016
ReComp scenarios
ReComp scenario Target Impact areas Why is ReComp
relevant?
Proof of concept
experiments
Expected
optimisation
Dataflow,
experimental
science
Genomics - Rapid Knowledge
advances
- Rapid scaling up
of genetic testing
at population level
WES/SVI pipeline,
workflow
implementation
(eScience Central)
Timeliness and
accuracy of patient
diagnosis subject to
budget constraints
Time series analysis - Personal health
monitoring
- Smart city
analytics
- IoT data streams
- Rapid data drift
- Cost of computation
at network edge (eg
IoT)
NYC taxi rides
challenge (DEBS’15)
Use of low-power
edge devices when
outcome is
predictable and data
drift is low
Data layer
optimisation
Tuning of large-scale
Data management
stack
Optimal Data
organisation sensitive
to current data
profiles
Graph DB re-
partitioning
System throughput vs
cost of re-tuning
Model learning Applications of
predictive analytics
Predictive models are
very sensitive to data
drift
Twitter content
analysis
Sustained model
predictive power over
time vs retraining
cost
Simulation TBD repeated simulation.
Computationally
expensive but often
not beneficial
Flood modelling /
CityCat Newcastle
Computational
resources vs
marginal benefit of
new simulation model
7. ReComp@Scalable
Newcaslte,May9,2016
Data-intensive systems: two properties
1. Observability (transparency)
How much of a data-intensive system can we observe?
• structure + data flow
2. Control
How much control do we have on the system?
• Execution frequency, total, partial
• Input density
8. ReComp@Scalable
Newcaslte,May9,2016
Observability / transparency
White box Black box
Structure
(static view)
Dataflow
- eScience Central, Taverna, VisTrails…
Scripting:
- R, Matlab, Python...
- Functions semantics
- Packaged components
- Third party services
Data
dependencies
(runtime
view)
Provenance recording:
• Inputs,
• Reference datasets,
• Component versions,
• Outputs
• Input
• Outputs
• No data dependencies
• No details on individual
components
Cost • Detailed resource monitoring
• Cloud £££
• Wall clock time
• Service pricing
• Setup time (eg model
learning)
10. ReComp@Scalable
Newcaslte,May9,2016
x11
x12 y11
P
D11 D12
White box ReComp
Observables:
Inputs X = {xi1, x12, …}
Outputs y = {yi1, yi2,…}
Dependencies D11, D12, ...
Provenance prov(y)
Cost(y)
Process structure P = P1.P2…Pk
Measurables changes:
Input diff: δ(xt, xt+1)
Output diff: δ(yt, yt+1)
Dependency diff: δ(Dt, Dt+1)
For each run i:
Control:
Complete / partial rerun
11. ReComp@Scalable
Newcaslte,May9,2016
A history of runs
x11
x12 y1
P
D11 DCV
Run 1,
Patient A
x21
x22 y2
P
D21 DCV
Run 2,
Patient B
H = {<x,D,P,y, prov(y), cost(y)>}
Note on dependencies Dij:
Fine-grained: Dij are the results of a query to a dependent data source (OMIM)
Coarse-grained: only record that (a specific version of) OMIM has been used
12. ReComp@Scalable
Newcaslte,May9,2016
ReComp questions
• Scoping:
Which patients (from a large cohort) are going to be affected by
change in input/reference data?
• Impact:
For each patient in scope, how likely is each patient’s diagnosis to
change?
Approach:
Given Dt+1 and changes δ(Dt, Dt+1):
For each patient X and outcome y, query prov(y) to find references
to Dt.
Patient X is in scope if prov(y) ∩ δ(Dt, Dt+1) is not empty
Paolo, Missier, Cala Jacek, and Eldarina Wijaya. “The Data, They Are a-Changin’.” In
Proc. TAPP’16 (Theory and Practice of Provenance), edited by Sarah Cohen-Boulakia.
Washington D.C., USA: USENIX Association, 2016. https://arxiv.org/abs/1604.06412.
13. ReComp@Scalable
Newcaslte,May9,2016
Control:
Simulation rerun
Grid resolution
Regional boundaries
Observables:
Inputs X = {xi1, x12, …}
Outputs y = {yi1, yi2,…}
Cost(y)
Measurables changes:
Input diff: δ(xt, xt+1)
CityCAT (City Catchment Analysis Tool)
Inputs
• Topography (DEMs from LIDAR)
• Physical structures (buildings etc.)
• Landuse data
Outputs
• high resolution grid of flood depths
Example: flood modelling in Newcastle
14. ReComp@Scalable
Newcaslte,May9,2016
Example: model learning pattern
Observables:
Outputs y = {yi1, yi2,…}
Cost(y) (retraining)
Measurables changes:
Output quality relative to
ground truth: qty(yt)
Black box ReComp:
Control:
Request to retrain
15. ReComp@Scalable
Newcaslte,May9,2016
Black box ReComp – I
• When does the model require retraining?
• What is the expected cost and benefit of re-training the model at
any given time?
The Velox system / Berkeley AMP Lab:
Crankshaw, Daniel, Peter Bailis, Joseph E Gonzalez, Haoyuan Li, Zhao Zhang, Michael J Franklin, Ali
Ghodsi, and Michael I Jordan. “The Missing Piece in Complex Analytics: Low Latency, Scalable
Model Management and Serving with Velox.” In Procs CIDR 2015, Seventh Biennial Conference on
Innovative Data Systems Research, Asilomar, CA,
USAhttp://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper19u.pdf.
16. ReComp@Scalable
Newcaslte,May9,2016
Example: time series pattern
Control input data drift:
• How often and how densely should we sample from the stream to
keep the output sufficiently current?
Black box ReComp – top-k most frequent NYC taxi routes over time
Observables:
Outputs y = {yi1, yi2,…}
Cost(y)
Measurables changes:
Output diff: δ(yt, yt+1)
Control:
Sample frequency /
Sample density
17. ReComp@Scalable
Newcaslte,May9,2016
Example: graph repartitioning
• Taper Partitioner optimises for a given query
workload
• Performance of a partitioning is well defined
• # inter-partition traversals
• Performance degrades when query workload
changes
0
500000
1000000
1500000
2000000
2500000
3000000
0% 20% 40% 60% 80% 100% 120%
Inter-partitiontraversals
% Workload change
Inter-partition traversals vs. %
Workload change
Observables:
Outputs y = {yi1, yi2,…}
Cost(y) (re-partition)
Control:
Re-partition requests
Measurables
changes:
Output quality: #ipt
18. ReComp@Scalable
Newcaslte,May9,2016
A summary of ReComp problems
Observables:
Inputs X = {xi1, x12, …}
Outputs y = {yi1, yi2,…}
Dependencies D11, D12, ...
Provenance prov(y)
Cost(y)
Process structure P = P1.P2…Pk
Measurables changes:
Input diff: δ(xt, xt+1)
Output diff: δ(yt, yt+1)
Dependency diff: δ(Dt, Dt+1)
Quality(y)
Control:
- Data selection
- Partial / complete rerun
Problem Requires Sample problems
Forwards
Scoping:
Identify affected population
subset
White box
• Dataflow analytics,
• Simulation
• Many-runs problemsChange Impact analysis:
inputs, dependencies
outputs
Backwards React to output instability /
input drift
Black box - Model learning
- Time series analytics
- Data-driven optimisation
19. ReComp@Scalable
Newcaslte,May9,2016
The next steps -- challenges
1. Optimisation:
Observables + control reactive system
+ cost and utility functions optimisation problems
2. Learning from history
Can we use history to learn estimates of impact without the need for
actual re-computation?
3. Software infrastructure and tooling
ReComp is a metadata management and analytics exercise
4. Reproducibility:
What really happens when I press the “ReComp” button?
5. Impact:
How do address key impact areas
- e-health
- Genomics
- Smart city management