SlideShare a Scribd company logo
1 of 19
ReComp@Scalable
Newcaslte,May9,2016
Your data won’t stay smart forever:
exploring the temporal dimension of (big data) analytics
Paolo Missier, Jacek Cala, Manisha Rathi
Scalable Computing Group Seminar
Newcastle, May 2016
(*) Painting by Johannes Moreelse
(*)
Panta Rhei (Heraclitus)
ReComp@Scalable
Newcaslte,May9,2016
Data to Knowledge
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
The Data-to-knowledge axiom of the Knowledge Economy:
ReComp@Scalable
Newcaslte,May9,2016
The missing element: time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
ReComp@Scalable
Newcaslte,May9,2016
The ReComp decision support system
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
ReComp@Scalable
Newcaslte,May9,2016
ReComp
Change
Events
Diff(.,.)
functions
“business
Rules”
Prioritised KAs
Cost estimates
Reproducibility
assessment
ReComp
Decision
Support
System
Previously
Computed KAs
And their metadata
Observe
change
Assess and
measure
Estimate
Enact
KA: Knowledge Assets
ReComp@Scalable
Newcaslte,May9,2016
ReComp scenarios
ReComp scenario Target Impact areas Why is ReComp
relevant?
Proof of concept
experiments
Expected
optimisation
Dataflow,
experimental
science
Genomics - Rapid Knowledge
advances
- Rapid scaling up
of genetic testing
at population level
WES/SVI pipeline,
workflow
implementation
(eScience Central)
Timeliness and
accuracy of patient
diagnosis subject to
budget constraints
Time series analysis - Personal health
monitoring
- Smart city
analytics
- IoT data streams
- Rapid data drift
- Cost of computation
at network edge (eg
IoT)
NYC taxi rides
challenge (DEBS’15)
Use of low-power
edge devices when
outcome is
predictable and data
drift is low
Data layer
optimisation
Tuning of large-scale
Data management
stack
Optimal Data
organisation sensitive
to current data
profiles
Graph DB re-
partitioning
System throughput vs
cost of re-tuning
Model learning Applications of
predictive analytics
Predictive models are
very sensitive to data
drift
Twitter content
analysis
Sustained model
predictive power over
time vs retraining
cost
Simulation TBD repeated simulation.
Computationally
expensive but often
not beneficial
Flood modelling /
CityCat Newcastle
Computational
resources vs
marginal benefit of
new simulation model
ReComp@Scalable
Newcaslte,May9,2016
Data-intensive systems: two properties
1. Observability (transparency)
How much of a data-intensive system can we observe?
• structure + data flow
2. Control
How much control do we have on the system?
• Execution frequency, total, partial
• Input density
ReComp@Scalable
Newcaslte,May9,2016
Observability / transparency
White box Black box
Structure
(static view)
Dataflow
- eScience Central, Taverna, VisTrails…
Scripting:
- R, Matlab, Python...
- Functions semantics
- Packaged components
- Third party services
Data
dependencies
(runtime
view)
Provenance recording:
• Inputs,
• Reference datasets,
• Component versions,
• Outputs
• Input
• Outputs
• No data dependencies
• No details on individual
components
Cost • Detailed resource monitoring
• Cloud  £££
• Wall clock time
• Service pricing
• Setup time (eg model
learning)
ReComp@Scalable
Newcaslte,May9,2016
Example: genomics / variant interpretation
What changes:
- Patient variants  improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources
ReComp@Scalable
Newcaslte,May9,2016
x11
x12 y11
P
D11 D12
White box ReComp
Observables:
Inputs X = {xi1, x12, …}
Outputs y = {yi1, yi2,…}
Dependencies D11, D12, ...
Provenance prov(y)
Cost(y)
Process structure P = P1.P2…Pk
Measurables changes:
Input diff: δ(xt, xt+1)
Output diff: δ(yt, yt+1)
Dependency diff: δ(Dt, Dt+1)
For each run i:
Control:
Complete / partial rerun
ReComp@Scalable
Newcaslte,May9,2016
A history of runs
x11
x12 y1
P
D11 DCV
Run 1,
Patient A
x21
x22 y2
P
D21 DCV
Run 2,
Patient B
H = {<x,D,P,y, prov(y), cost(y)>}
Note on dependencies Dij:
Fine-grained: Dij are the results of a query to a dependent data source (OMIM)
Coarse-grained: only record that (a specific version of) OMIM has been used
ReComp@Scalable
Newcaslte,May9,2016
ReComp questions
• Scoping:
Which patients (from a large cohort) are going to be affected by
change in input/reference data?
• Impact:
For each patient in scope, how likely is each patient’s diagnosis to
change?
Approach:
Given Dt+1 and changes δ(Dt, Dt+1):
For each patient X and outcome y, query prov(y) to find references
to Dt.
Patient X is in scope if prov(y) ∩ δ(Dt, Dt+1) is not empty
Paolo, Missier, Cala Jacek, and Eldarina Wijaya. “The Data, They Are a-Changin’.” In
Proc. TAPP’16 (Theory and Practice of Provenance), edited by Sarah Cohen-Boulakia.
Washington D.C., USA: USENIX Association, 2016. https://arxiv.org/abs/1604.06412.
ReComp@Scalable
Newcaslte,May9,2016
Control:
Simulation rerun
Grid resolution
Regional boundaries
Observables:
Inputs X = {xi1, x12, …}
Outputs y = {yi1, yi2,…}
Cost(y)
Measurables changes:
Input diff: δ(xt, xt+1)
CityCAT (City Catchment Analysis Tool)
Inputs
• Topography (DEMs from LIDAR)
• Physical structures (buildings etc.)
• Landuse data
Outputs
• high resolution grid of flood depths
Example: flood modelling in Newcastle
ReComp@Scalable
Newcaslte,May9,2016
Example: model learning pattern
Observables:
Outputs y = {yi1, yi2,…}
Cost(y) (retraining)
Measurables changes:
Output quality relative to
ground truth: qty(yt)
Black box ReComp:
Control:
Request to retrain
ReComp@Scalable
Newcaslte,May9,2016
Black box ReComp – I
• When does the model require retraining?
• What is the expected cost and benefit of re-training the model at
any given time?
The Velox system / Berkeley AMP Lab:
Crankshaw, Daniel, Peter Bailis, Joseph E Gonzalez, Haoyuan Li, Zhao Zhang, Michael J Franklin, Ali
Ghodsi, and Michael I Jordan. “The Missing Piece in Complex Analytics: Low Latency, Scalable
Model Management and Serving with Velox.” In Procs CIDR 2015, Seventh Biennial Conference on
Innovative Data Systems Research, Asilomar, CA,
USAhttp://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper19u.pdf.
ReComp@Scalable
Newcaslte,May9,2016
Example: time series pattern
Control input data drift:
• How often and how densely should we sample from the stream to
keep the output sufficiently current?
Black box ReComp – top-k most frequent NYC taxi routes over time
Observables:
Outputs y = {yi1, yi2,…}
Cost(y)
Measurables changes:
Output diff: δ(yt, yt+1)
Control:
Sample frequency /
Sample density
ReComp@Scalable
Newcaslte,May9,2016
Example: graph repartitioning
• Taper Partitioner optimises for a given query
workload
• Performance of a partitioning is well defined
• # inter-partition traversals
• Performance degrades when query workload
changes
0
500000
1000000
1500000
2000000
2500000
3000000
0% 20% 40% 60% 80% 100% 120%
Inter-partitiontraversals
% Workload change
Inter-partition traversals vs. %
Workload change
Observables:
Outputs y = {yi1, yi2,…}
Cost(y) (re-partition)
Control:
Re-partition requests
Measurables
changes:
Output quality: #ipt
ReComp@Scalable
Newcaslte,May9,2016
A summary of ReComp problems
Observables:
Inputs X = {xi1, x12, …}
Outputs y = {yi1, yi2,…}
Dependencies D11, D12, ...
Provenance prov(y)
Cost(y)
Process structure P = P1.P2…Pk
Measurables changes:
Input diff: δ(xt, xt+1)
Output diff: δ(yt, yt+1)
Dependency diff: δ(Dt, Dt+1)
Quality(y)
Control:
- Data selection
- Partial / complete rerun
Problem Requires Sample problems
Forwards
Scoping:
Identify affected population
subset
White box
• Dataflow analytics,
• Simulation
• Many-runs problemsChange Impact analysis:
inputs, dependencies 
outputs
Backwards React to output instability /
input drift
Black box - Model learning
- Time series analytics
- Data-driven optimisation
ReComp@Scalable
Newcaslte,May9,2016
The next steps -- challenges
1. Optimisation:
Observables + control  reactive system
+ cost and utility functions  optimisation problems
2. Learning from history
Can we use history to learn estimates of impact without the need for
actual re-computation?
3. Software infrastructure and tooling
ReComp is a metadata management and analytics exercise
4. Reproducibility:
What really happens when I press the “ReComp” button?
5. Impact:
How do address key impact areas
- e-health
- Genomics
- Smart city management

More Related Content

What's hot

ReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case StudyReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case StudyPaolo Missier
 
Data Trajectories: tracking the reuse of published data for transitive credi...
Data Trajectories: tracking the reuse of published datafor transitive credi...Data Trajectories: tracking the reuse of published datafor transitive credi...
Data Trajectories: tracking the reuse of published data for transitive credi...Paolo Missier
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Paolo Missier
 
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big DataPradeeban Kathiravelu, Ph.D.
 
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...Ioannis Katakis
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
 
Networking Materials Data
Networking Materials DataNetworking Materials Data
Networking Materials DataIan Foster
 
A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...JPINFOTECH JAYAPRAKASH
 
DataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leadersDataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leadersDataXDay Conference by Xebia
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataPablo Bernabeu
 

What's hot (19)

ReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case StudyReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case Study
 
Data Trajectories: tracking the reuse of published data for transitive credi...
Data Trajectories: tracking the reuse of published datafor transitive credi...Data Trajectories: tracking the reuse of published datafor transitive credi...
Data Trajectories: tracking the reuse of published data for transitive credi...
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...
 
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
 
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
 
FDSE2015
FDSE2015FDSE2015
FDSE2015
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
Networking Materials Data
Networking Materials DataNetworking Materials Data
Networking Materials Data
 
A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...
 
DataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leadersDataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leaders
 
Big data
Big dataBig data
Big data
 
Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open data
 

Similar to Your data won’t stay smart forever: exploring the temporal dimension of (big data) analytics

Probabilistic Modelling with Information Filtering Networks
Probabilistic Modelling with Information Filtering NetworksProbabilistic Modelling with Information Filtering Networks
Probabilistic Modelling with Information Filtering NetworksTomaso Aste
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...DataStax Academy
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Safe Software
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarinn5712036
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIPaul Groth
 

Similar to Your data won’t stay smart forever: exploring the temporal dimension of (big data) analytics (20)

Probabilistic Modelling with Information Filtering Networks
Probabilistic Modelling with Information Filtering NetworksProbabilistic Modelling with Information Filtering Networks
Probabilistic Modelling with Information Filtering Networks
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Real Time Geodemographics
Real Time GeodemographicsReal Time Geodemographics
Real Time Geodemographics
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
io-Chem-BD, una solució per gestionar el Big Data en Química Computacional
io-Chem-BD, una solució per gestionar el Big Data en Química Computacionalio-Chem-BD, una solució per gestionar el Big Data en Química Computacional
io-Chem-BD, una solució per gestionar el Big Data en Química Computacional
 
rerngvit_phd_seminar
rerngvit_phd_seminarrerngvit_phd_seminar
rerngvit_phd_seminar
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 

More from Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthPaolo Missier
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationPaolo Missier
 

More from Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-Computation
 

Recently uploaded

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Your data won’t stay smart forever: exploring the temporal dimension of (big data) analytics

  • 1. ReComp@Scalable Newcaslte,May9,2016 Your data won’t stay smart forever: exploring the temporal dimension of (big data) analytics Paolo Missier, Jacek Cala, Manisha Rathi Scalable Computing Group Seminar Newcastle, May 2016 (*) Painting by Johannes Moreelse (*) Panta Rhei (Heraclitus)
  • 2. ReComp@Scalable Newcaslte,May9,2016 Data to Knowledge Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge” The Data-to-knowledge axiom of the Knowledge Economy:
  • 3. ReComp@Scalable Newcaslte,May9,2016 The missing element: time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  • 4. ReComp@Scalable Newcaslte,May9,2016 The ReComp decision support system Observe change • In big data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  • 6. ReComp@Scalable Newcaslte,May9,2016 ReComp scenarios ReComp scenario Target Impact areas Why is ReComp relevant? Proof of concept experiments Expected optimisation Dataflow, experimental science Genomics - Rapid Knowledge advances - Rapid scaling up of genetic testing at population level WES/SVI pipeline, workflow implementation (eScience Central) Timeliness and accuracy of patient diagnosis subject to budget constraints Time series analysis - Personal health monitoring - Smart city analytics - IoT data streams - Rapid data drift - Cost of computation at network edge (eg IoT) NYC taxi rides challenge (DEBS’15) Use of low-power edge devices when outcome is predictable and data drift is low Data layer optimisation Tuning of large-scale Data management stack Optimal Data organisation sensitive to current data profiles Graph DB re- partitioning System throughput vs cost of re-tuning Model learning Applications of predictive analytics Predictive models are very sensitive to data drift Twitter content analysis Sustained model predictive power over time vs retraining cost Simulation TBD repeated simulation. Computationally expensive but often not beneficial Flood modelling / CityCat Newcastle Computational resources vs marginal benefit of new simulation model
  • 7. ReComp@Scalable Newcaslte,May9,2016 Data-intensive systems: two properties 1. Observability (transparency) How much of a data-intensive system can we observe? • structure + data flow 2. Control How much control do we have on the system? • Execution frequency, total, partial • Input density
  • 8. ReComp@Scalable Newcaslte,May9,2016 Observability / transparency White box Black box Structure (static view) Dataflow - eScience Central, Taverna, VisTrails… Scripting: - R, Matlab, Python... - Functions semantics - Packaged components - Third party services Data dependencies (runtime view) Provenance recording: • Inputs, • Reference datasets, • Component versions, • Outputs • Input • Outputs • No data dependencies • No details on individual components Cost • Detailed resource monitoring • Cloud  £££ • Wall clock time • Service pricing • Setup time (eg model learning)
  • 9. ReComp@Scalable Newcaslte,May9,2016 Example: genomics / variant interpretation What changes: - Patient variants  improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources
  • 10. ReComp@Scalable Newcaslte,May9,2016 x11 x12 y11 P D11 D12 White box ReComp Observables: Inputs X = {xi1, x12, …} Outputs y = {yi1, yi2,…} Dependencies D11, D12, ... Provenance prov(y) Cost(y) Process structure P = P1.P2…Pk Measurables changes: Input diff: δ(xt, xt+1) Output diff: δ(yt, yt+1) Dependency diff: δ(Dt, Dt+1) For each run i: Control: Complete / partial rerun
  • 11. ReComp@Scalable Newcaslte,May9,2016 A history of runs x11 x12 y1 P D11 DCV Run 1, Patient A x21 x22 y2 P D21 DCV Run 2, Patient B H = {<x,D,P,y, prov(y), cost(y)>} Note on dependencies Dij: Fine-grained: Dij are the results of a query to a dependent data source (OMIM) Coarse-grained: only record that (a specific version of) OMIM has been used
  • 12. ReComp@Scalable Newcaslte,May9,2016 ReComp questions • Scoping: Which patients (from a large cohort) are going to be affected by change in input/reference data? • Impact: For each patient in scope, how likely is each patient’s diagnosis to change? Approach: Given Dt+1 and changes δ(Dt, Dt+1): For each patient X and outcome y, query prov(y) to find references to Dt. Patient X is in scope if prov(y) ∩ δ(Dt, Dt+1) is not empty Paolo, Missier, Cala Jacek, and Eldarina Wijaya. “The Data, They Are a-Changin’.” In Proc. TAPP’16 (Theory and Practice of Provenance), edited by Sarah Cohen-Boulakia. Washington D.C., USA: USENIX Association, 2016. https://arxiv.org/abs/1604.06412.
  • 13. ReComp@Scalable Newcaslte,May9,2016 Control: Simulation rerun Grid resolution Regional boundaries Observables: Inputs X = {xi1, x12, …} Outputs y = {yi1, yi2,…} Cost(y) Measurables changes: Input diff: δ(xt, xt+1) CityCAT (City Catchment Analysis Tool) Inputs • Topography (DEMs from LIDAR) • Physical structures (buildings etc.) • Landuse data Outputs • high resolution grid of flood depths Example: flood modelling in Newcastle
  • 14. ReComp@Scalable Newcaslte,May9,2016 Example: model learning pattern Observables: Outputs y = {yi1, yi2,…} Cost(y) (retraining) Measurables changes: Output quality relative to ground truth: qty(yt) Black box ReComp: Control: Request to retrain
  • 15. ReComp@Scalable Newcaslte,May9,2016 Black box ReComp – I • When does the model require retraining? • What is the expected cost and benefit of re-training the model at any given time? The Velox system / Berkeley AMP Lab: Crankshaw, Daniel, Peter Bailis, Joseph E Gonzalez, Haoyuan Li, Zhao Zhang, Michael J Franklin, Ali Ghodsi, and Michael I Jordan. “The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox.” In Procs CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USAhttp://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper19u.pdf.
  • 16. ReComp@Scalable Newcaslte,May9,2016 Example: time series pattern Control input data drift: • How often and how densely should we sample from the stream to keep the output sufficiently current? Black box ReComp – top-k most frequent NYC taxi routes over time Observables: Outputs y = {yi1, yi2,…} Cost(y) Measurables changes: Output diff: δ(yt, yt+1) Control: Sample frequency / Sample density
  • 17. ReComp@Scalable Newcaslte,May9,2016 Example: graph repartitioning • Taper Partitioner optimises for a given query workload • Performance of a partitioning is well defined • # inter-partition traversals • Performance degrades when query workload changes 0 500000 1000000 1500000 2000000 2500000 3000000 0% 20% 40% 60% 80% 100% 120% Inter-partitiontraversals % Workload change Inter-partition traversals vs. % Workload change Observables: Outputs y = {yi1, yi2,…} Cost(y) (re-partition) Control: Re-partition requests Measurables changes: Output quality: #ipt
  • 18. ReComp@Scalable Newcaslte,May9,2016 A summary of ReComp problems Observables: Inputs X = {xi1, x12, …} Outputs y = {yi1, yi2,…} Dependencies D11, D12, ... Provenance prov(y) Cost(y) Process structure P = P1.P2…Pk Measurables changes: Input diff: δ(xt, xt+1) Output diff: δ(yt, yt+1) Dependency diff: δ(Dt, Dt+1) Quality(y) Control: - Data selection - Partial / complete rerun Problem Requires Sample problems Forwards Scoping: Identify affected population subset White box • Dataflow analytics, • Simulation • Many-runs problemsChange Impact analysis: inputs, dependencies  outputs Backwards React to output instability / input drift Black box - Model learning - Time series analytics - Data-driven optimisation
  • 19. ReComp@Scalable Newcaslte,May9,2016 The next steps -- challenges 1. Optimisation: Observables + control  reactive system + cost and utility functions  optimisation problems 2. Learning from history Can we use history to learn estimates of impact without the need for actual re-computation? 3. Software infrastructure and tooling ReComp is a metadata management and analytics exercise 4. Reproducibility: What really happens when I press the “ReComp” button? 5. Impact: How do address key impact areas - e-health - Genomics - Smart city management

Editor's Notes

  1. The times they are a’changin