SlideShare a Scribd company logo
1 of 19
ReComp@Scalable
Newcaslte,May9,2016
Your data won’t stay smart forever:
exploring the temporal dimension of (big data) analytics
Paolo Missier, Jacek Cala, Manisha Rathi
Scalable Computing Group Seminar
Newcastle, May 2016
(*) Painting by Johannes Moreelse
(*)
Panta Rhei (Heraclitus)
ReComp@Scalable
Newcaslte,May9,2016
Data to Knowledge
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
The Data-to-knowledge axiom of the Knowledge Economy:
ReComp@Scalable
Newcaslte,May9,2016
The missing element: time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
ReComp@Scalable
Newcaslte,May9,2016
The ReComp decision support system
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
ReComp@Scalable
Newcaslte,May9,2016
ReComp
Change
Events
Diff(.,.)
functions
“business
Rules”
Prioritised KAs
Cost estimates
Reproducibility
assessment
ReComp
Decision
Support
System
Previously
Computed KAs
And their metadata
Observe
change
Assess and
measure
Estimate
Enact
KA: Knowledge Assets
ReComp@Scalable
Newcaslte,May9,2016
ReComp scenarios
ReComp scenario Target Impact areas Why is ReComp
relevant?
Proof of concept
experiments
Expected
optimisation
Dataflow,
experimental
science
Genomics - Rapid Knowledge
advances
- Rapid scaling up
of genetic testing
at population level
WES/SVI pipeline,
workflow
implementation
(eScience Central)
Timeliness and
accuracy of patient
diagnosis subject to
budget constraints
Time series analysis - Personal health
monitoring
- Smart city
analytics
- IoT data streams
- Rapid data drift
- Cost of computation
at network edge (eg
IoT)
NYC taxi rides
challenge (DEBS’15)
Use of low-power
edge devices when
outcome is
predictable and data
drift is low
Data layer
optimisation
Tuning of large-scale
Data management
stack
Optimal Data
organisation sensitive
to current data
profiles
Graph DB re-
partitioning
System throughput vs
cost of re-tuning
Model learning Applications of
predictive analytics
Predictive models are
very sensitive to data
drift
Twitter content
analysis
Sustained model
predictive power over
time vs retraining
cost
Simulation TBD repeated simulation.
Computationally
expensive but often
not beneficial
Flood modelling /
CityCat Newcastle
Computational
resources vs
marginal benefit of
new simulation model
ReComp@Scalable
Newcaslte,May9,2016
Data-intensive systems: two properties
1. Observability (transparency)
How much of a data-intensive system can we observe?
• structure + data flow
2. Control
How much control do we have on the system?
• Execution frequency, total, partial
• Input density
ReComp@Scalable
Newcaslte,May9,2016
Observability / transparency
White box Black box
Structure
(static view)
Dataflow
- eScience Central, Taverna, VisTrails…
Scripting:
- R, Matlab, Python...
- Functions semantics
- Packaged components
- Third party services
Data
dependencies
(runtime
view)
Provenance recording:
• Inputs,
• Reference datasets,
• Component versions,
• Outputs
• Input
• Outputs
• No data dependencies
• No details on individual
components
Cost • Detailed resource monitoring
• Cloud  £££
• Wall clock time
• Service pricing
• Setup time (eg model
learning)
ReComp@Scalable
Newcaslte,May9,2016
Example: genomics / variant interpretation
What changes:
- Patient variants  improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources
ReComp@Scalable
Newcaslte,May9,2016
x11
x12 y11
P
D11 D12
White box ReComp
Observables:
Inputs X = {xi1, x12, …}
Outputs y = {yi1, yi2,…}
Dependencies D11, D12, ...
Provenance prov(y)
Cost(y)
Process structure P = P1.P2…Pk
Measurables changes:
Input diff: δ(xt, xt+1)
Output diff: δ(yt, yt+1)
Dependency diff: δ(Dt, Dt+1)
For each run i:
Control:
Complete / partial rerun
ReComp@Scalable
Newcaslte,May9,2016
A history of runs
x11
x12 y1
P
D11 DCV
Run 1,
Patient A
x21
x22 y2
P
D21 DCV
Run 2,
Patient B
H = {<x,D,P,y, prov(y), cost(y)>}
Note on dependencies Dij:
Fine-grained: Dij are the results of a query to a dependent data source (OMIM)
Coarse-grained: only record that (a specific version of) OMIM has been used
ReComp@Scalable
Newcaslte,May9,2016
ReComp questions
• Scoping:
Which patients (from a large cohort) are going to be affected by
change in input/reference data?
• Impact:
For each patient in scope, how likely is each patient’s diagnosis to
change?
Approach:
Given Dt+1 and changes δ(Dt, Dt+1):
For each patient X and outcome y, query prov(y) to find references
to Dt.
Patient X is in scope if prov(y) ∩ δ(Dt, Dt+1) is not empty
Paolo, Missier, Cala Jacek, and Eldarina Wijaya. “The Data, They Are a-Changin’.” In
Proc. TAPP’16 (Theory and Practice of Provenance), edited by Sarah Cohen-Boulakia.
Washington D.C., USA: USENIX Association, 2016. https://arxiv.org/abs/1604.06412.
ReComp@Scalable
Newcaslte,May9,2016
Control:
Simulation rerun
Grid resolution
Regional boundaries
Observables:
Inputs X = {xi1, x12, …}
Outputs y = {yi1, yi2,…}
Cost(y)
Measurables changes:
Input diff: δ(xt, xt+1)
CityCAT (City Catchment Analysis Tool)
Inputs
• Topography (DEMs from LIDAR)
• Physical structures (buildings etc.)
• Landuse data
Outputs
• high resolution grid of flood depths
Example: flood modelling in Newcastle
ReComp@Scalable
Newcaslte,May9,2016
Example: model learning pattern
Observables:
Outputs y = {yi1, yi2,…}
Cost(y) (retraining)
Measurables changes:
Output quality relative to
ground truth: qty(yt)
Black box ReComp:
Control:
Request to retrain
ReComp@Scalable
Newcaslte,May9,2016
Black box ReComp – I
• When does the model require retraining?
• What is the expected cost and benefit of re-training the model at
any given time?
The Velox system / Berkeley AMP Lab:
Crankshaw, Daniel, Peter Bailis, Joseph E Gonzalez, Haoyuan Li, Zhao Zhang, Michael J Franklin, Ali
Ghodsi, and Michael I Jordan. “The Missing Piece in Complex Analytics: Low Latency, Scalable
Model Management and Serving with Velox.” In Procs CIDR 2015, Seventh Biennial Conference on
Innovative Data Systems Research, Asilomar, CA,
USAhttp://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper19u.pdf.
ReComp@Scalable
Newcaslte,May9,2016
Example: time series pattern
Control input data drift:
• How often and how densely should we sample from the stream to
keep the output sufficiently current?
Black box ReComp – top-k most frequent NYC taxi routes over time
Observables:
Outputs y = {yi1, yi2,…}
Cost(y)
Measurables changes:
Output diff: δ(yt, yt+1)
Control:
Sample frequency /
Sample density
ReComp@Scalable
Newcaslte,May9,2016
Example: graph repartitioning
• Taper Partitioner optimises for a given query
workload
• Performance of a partitioning is well defined
• # inter-partition traversals
• Performance degrades when query workload
changes
0
500000
1000000
1500000
2000000
2500000
3000000
0% 20% 40% 60% 80% 100% 120%
Inter-partitiontraversals
% Workload change
Inter-partition traversals vs. %
Workload change
Observables:
Outputs y = {yi1, yi2,…}
Cost(y) (re-partition)
Control:
Re-partition requests
Measurables
changes:
Output quality: #ipt
ReComp@Scalable
Newcaslte,May9,2016
A summary of ReComp problems
Observables:
Inputs X = {xi1, x12, …}
Outputs y = {yi1, yi2,…}
Dependencies D11, D12, ...
Provenance prov(y)
Cost(y)
Process structure P = P1.P2…Pk
Measurables changes:
Input diff: δ(xt, xt+1)
Output diff: δ(yt, yt+1)
Dependency diff: δ(Dt, Dt+1)
Quality(y)
Control:
- Data selection
- Partial / complete rerun
Problem Requires Sample problems
Forwards
Scoping:
Identify affected population
subset
White box
• Dataflow analytics,
• Simulation
• Many-runs problemsChange Impact analysis:
inputs, dependencies 
outputs
Backwards React to output instability /
input drift
Black box - Model learning
- Time series analytics
- Data-driven optimisation
ReComp@Scalable
Newcaslte,May9,2016
The next steps -- challenges
1. Optimisation:
Observables + control  reactive system
+ cost and utility functions  optimisation problems
2. Learning from history
Can we use history to learn estimates of impact without the need for
actual re-computation?
3. Software infrastructure and tooling
ReComp is a metadata management and analytics exercise
4. Reproducibility:
What really happens when I press the “ReComp” button?
5. Impact:
How do address key impact areas
- e-health
- Genomics
- Smart city management

More Related Content

What's hot

∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
Pradeeban Kathiravelu, Ph.D.
 
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Ioannis Katakis
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
Neeraj Tewari
 

What's hot (19)

ReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case StudyReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case Study
 
Data Trajectories: tracking the reuse of published data for transitive credi...
Data Trajectories: tracking the reuse of published datafor transitive credi...Data Trajectories: tracking the reuse of published datafor transitive credi...
Data Trajectories: tracking the reuse of published data for transitive credi...
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...
 
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
 
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
 
FDSE2015
FDSE2015FDSE2015
FDSE2015
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
Networking Materials Data
Networking Materials DataNetworking Materials Data
Networking Materials Data
 
A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...
 
DataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leadersDataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leaders
 
Big data
Big dataBig data
Big data
 
Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open data
 

Similar to Your data won’t stay smart forever: exploring the temporal dimension of (big data) analytics

Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
 

Similar to Your data won’t stay smart forever: exploring the temporal dimension of (big data) analytics (20)

Probabilistic Modelling with Information Filtering Networks
Probabilistic Modelling with Information Filtering NetworksProbabilistic Modelling with Information Filtering Networks
Probabilistic Modelling with Information Filtering Networks
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Real Time Geodemographics
Real Time GeodemographicsReal Time Geodemographics
Real Time Geodemographics
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
io-Chem-BD, una solució per gestionar el Big Data en Química Computacional
io-Chem-BD, una solució per gestionar el Big Data en Química Computacionalio-Chem-BD, una solució per gestionar el Big Data en Química Computacional
io-Chem-BD, una solució per gestionar el Big Data en Química Computacional
 
rerngvit_phd_seminar
rerngvit_phd_seminarrerngvit_phd_seminar
rerngvit_phd_seminar
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 

More from Paolo Missier

Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

More from Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-Computation
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Your data won’t stay smart forever: exploring the temporal dimension of (big data) analytics

  • 1. ReComp@Scalable Newcaslte,May9,2016 Your data won’t stay smart forever: exploring the temporal dimension of (big data) analytics Paolo Missier, Jacek Cala, Manisha Rathi Scalable Computing Group Seminar Newcastle, May 2016 (*) Painting by Johannes Moreelse (*) Panta Rhei (Heraclitus)
  • 2. ReComp@Scalable Newcaslte,May9,2016 Data to Knowledge Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge” The Data-to-knowledge axiom of the Knowledge Economy:
  • 3. ReComp@Scalable Newcaslte,May9,2016 The missing element: time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  • 4. ReComp@Scalable Newcaslte,May9,2016 The ReComp decision support system Observe change • In big data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  • 6. ReComp@Scalable Newcaslte,May9,2016 ReComp scenarios ReComp scenario Target Impact areas Why is ReComp relevant? Proof of concept experiments Expected optimisation Dataflow, experimental science Genomics - Rapid Knowledge advances - Rapid scaling up of genetic testing at population level WES/SVI pipeline, workflow implementation (eScience Central) Timeliness and accuracy of patient diagnosis subject to budget constraints Time series analysis - Personal health monitoring - Smart city analytics - IoT data streams - Rapid data drift - Cost of computation at network edge (eg IoT) NYC taxi rides challenge (DEBS’15) Use of low-power edge devices when outcome is predictable and data drift is low Data layer optimisation Tuning of large-scale Data management stack Optimal Data organisation sensitive to current data profiles Graph DB re- partitioning System throughput vs cost of re-tuning Model learning Applications of predictive analytics Predictive models are very sensitive to data drift Twitter content analysis Sustained model predictive power over time vs retraining cost Simulation TBD repeated simulation. Computationally expensive but often not beneficial Flood modelling / CityCat Newcastle Computational resources vs marginal benefit of new simulation model
  • 7. ReComp@Scalable Newcaslte,May9,2016 Data-intensive systems: two properties 1. Observability (transparency) How much of a data-intensive system can we observe? • structure + data flow 2. Control How much control do we have on the system? • Execution frequency, total, partial • Input density
  • 8. ReComp@Scalable Newcaslte,May9,2016 Observability / transparency White box Black box Structure (static view) Dataflow - eScience Central, Taverna, VisTrails… Scripting: - R, Matlab, Python... - Functions semantics - Packaged components - Third party services Data dependencies (runtime view) Provenance recording: • Inputs, • Reference datasets, • Component versions, • Outputs • Input • Outputs • No data dependencies • No details on individual components Cost • Detailed resource monitoring • Cloud  £££ • Wall clock time • Service pricing • Setup time (eg model learning)
  • 9. ReComp@Scalable Newcaslte,May9,2016 Example: genomics / variant interpretation What changes: - Patient variants  improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources
  • 10. ReComp@Scalable Newcaslte,May9,2016 x11 x12 y11 P D11 D12 White box ReComp Observables: Inputs X = {xi1, x12, …} Outputs y = {yi1, yi2,…} Dependencies D11, D12, ... Provenance prov(y) Cost(y) Process structure P = P1.P2…Pk Measurables changes: Input diff: δ(xt, xt+1) Output diff: δ(yt, yt+1) Dependency diff: δ(Dt, Dt+1) For each run i: Control: Complete / partial rerun
  • 11. ReComp@Scalable Newcaslte,May9,2016 A history of runs x11 x12 y1 P D11 DCV Run 1, Patient A x21 x22 y2 P D21 DCV Run 2, Patient B H = {<x,D,P,y, prov(y), cost(y)>} Note on dependencies Dij: Fine-grained: Dij are the results of a query to a dependent data source (OMIM) Coarse-grained: only record that (a specific version of) OMIM has been used
  • 12. ReComp@Scalable Newcaslte,May9,2016 ReComp questions • Scoping: Which patients (from a large cohort) are going to be affected by change in input/reference data? • Impact: For each patient in scope, how likely is each patient’s diagnosis to change? Approach: Given Dt+1 and changes δ(Dt, Dt+1): For each patient X and outcome y, query prov(y) to find references to Dt. Patient X is in scope if prov(y) ∩ δ(Dt, Dt+1) is not empty Paolo, Missier, Cala Jacek, and Eldarina Wijaya. “The Data, They Are a-Changin’.” In Proc. TAPP’16 (Theory and Practice of Provenance), edited by Sarah Cohen-Boulakia. Washington D.C., USA: USENIX Association, 2016. https://arxiv.org/abs/1604.06412.
  • 13. ReComp@Scalable Newcaslte,May9,2016 Control: Simulation rerun Grid resolution Regional boundaries Observables: Inputs X = {xi1, x12, …} Outputs y = {yi1, yi2,…} Cost(y) Measurables changes: Input diff: δ(xt, xt+1) CityCAT (City Catchment Analysis Tool) Inputs • Topography (DEMs from LIDAR) • Physical structures (buildings etc.) • Landuse data Outputs • high resolution grid of flood depths Example: flood modelling in Newcastle
  • 14. ReComp@Scalable Newcaslte,May9,2016 Example: model learning pattern Observables: Outputs y = {yi1, yi2,…} Cost(y) (retraining) Measurables changes: Output quality relative to ground truth: qty(yt) Black box ReComp: Control: Request to retrain
  • 15. ReComp@Scalable Newcaslte,May9,2016 Black box ReComp – I • When does the model require retraining? • What is the expected cost and benefit of re-training the model at any given time? The Velox system / Berkeley AMP Lab: Crankshaw, Daniel, Peter Bailis, Joseph E Gonzalez, Haoyuan Li, Zhao Zhang, Michael J Franklin, Ali Ghodsi, and Michael I Jordan. “The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox.” In Procs CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USAhttp://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper19u.pdf.
  • 16. ReComp@Scalable Newcaslte,May9,2016 Example: time series pattern Control input data drift: • How often and how densely should we sample from the stream to keep the output sufficiently current? Black box ReComp – top-k most frequent NYC taxi routes over time Observables: Outputs y = {yi1, yi2,…} Cost(y) Measurables changes: Output diff: δ(yt, yt+1) Control: Sample frequency / Sample density
  • 17. ReComp@Scalable Newcaslte,May9,2016 Example: graph repartitioning • Taper Partitioner optimises for a given query workload • Performance of a partitioning is well defined • # inter-partition traversals • Performance degrades when query workload changes 0 500000 1000000 1500000 2000000 2500000 3000000 0% 20% 40% 60% 80% 100% 120% Inter-partitiontraversals % Workload change Inter-partition traversals vs. % Workload change Observables: Outputs y = {yi1, yi2,…} Cost(y) (re-partition) Control: Re-partition requests Measurables changes: Output quality: #ipt
  • 18. ReComp@Scalable Newcaslte,May9,2016 A summary of ReComp problems Observables: Inputs X = {xi1, x12, …} Outputs y = {yi1, yi2,…} Dependencies D11, D12, ... Provenance prov(y) Cost(y) Process structure P = P1.P2…Pk Measurables changes: Input diff: δ(xt, xt+1) Output diff: δ(yt, yt+1) Dependency diff: δ(Dt, Dt+1) Quality(y) Control: - Data selection - Partial / complete rerun Problem Requires Sample problems Forwards Scoping: Identify affected population subset White box • Dataflow analytics, • Simulation • Many-runs problemsChange Impact analysis: inputs, dependencies  outputs Backwards React to output instability / input drift Black box - Model learning - Time series analytics - Data-driven optimisation
  • 19. ReComp@Scalable Newcaslte,May9,2016 The next steps -- challenges 1. Optimisation: Observables + control  reactive system + cost and utility functions  optimisation problems 2. Learning from history Can we use history to learn estimates of impact without the need for actual re-computation? 3. Software infrastructure and tooling ReComp is a metadata management and analytics exercise 4. Reproducibility: What really happens when I press the “ReComp” button? 5. Impact: How do address key impact areas - e-health - Genomics - Smart city management

Editor's Notes

  1. The times they are a’changin