Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Your data won’t stay smart forever: exploring the temporal dimension of (big data) analytics

397 views

Published on

Much of the knowledge produced through data-intensive computations is liable to decay over time, as the underlying data drifts, and the algorithms, tools, and external data sources used for processing change and evolve. Your genome, for example, does not change over time, but our understanding of it does. How often should be look back at it, in the hope to gain new insight e.g. into genetic diseases, and how much does that cost when you scale re-analysis to an entire population?
The "total cost of ownership” of knowledge derived from data (TCO-DK) includes the cost of refreshing the knowledge over time in addition to the initial analysis, but is often not a primary consideration.
The ReComp project aims to provide models, algorithms, and tools to help humans understand TCO-DK, i.e., the nature and impact of changes in data, and assess the cost and benefits of knowledge refresh.
In this talk we try and map the scope of ReComp, by giving a number of patterns that cover typical analytics scenarios where re-computation is appropriate. We specifically describe two such scenarios, where we are conducting small scale, proof-of-concept ReComp experiments to help us sketch the general ReComp architecture. This initial exercise reveals a multiplicity of problems and research challenges, which will inform the rest of the project

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Your data won’t stay smart forever: exploring the temporal dimension of (big data) analytics

  1. 1. ReComp@Scalable Newcaslte,May9,2016 Your data won’t stay smart forever: exploring the temporal dimension of (big data) analytics Paolo Missier, Jacek Cala, Manisha Rathi Scalable Computing Group Seminar Newcastle, May 2016 (*) Painting by Johannes Moreelse (*) Panta Rhei (Heraclitus)
  2. 2. ReComp@Scalable Newcaslte,May9,2016 Data to Knowledge Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge” The Data-to-knowledge axiom of the Knowledge Economy:
  3. 3. ReComp@Scalable Newcaslte,May9,2016 The missing element: time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  4. 4. ReComp@Scalable Newcaslte,May9,2016 The ReComp decision support system Observe change • In big data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  5. 5. ReComp@Scalable Newcaslte,May9,2016 ReComp Change Events Diff(.,.) functions “business Rules” Prioritised KAs Cost estimates Reproducibility assessment ReComp Decision Support System Previously Computed KAs And their metadata Observe change Assess and measure Estimate Enact KA: Knowledge Assets
  6. 6. ReComp@Scalable Newcaslte,May9,2016 ReComp scenarios ReComp scenario Target Impact areas Why is ReComp relevant? Proof of concept experiments Expected optimisation Dataflow, experimental science Genomics - Rapid Knowledge advances - Rapid scaling up of genetic testing at population level WES/SVI pipeline, workflow implementation (eScience Central) Timeliness and accuracy of patient diagnosis subject to budget constraints Time series analysis - Personal health monitoring - Smart city analytics - IoT data streams - Rapid data drift - Cost of computation at network edge (eg IoT) NYC taxi rides challenge (DEBS’15) Use of low-power edge devices when outcome is predictable and data drift is low Data layer optimisation Tuning of large-scale Data management stack Optimal Data organisation sensitive to current data profiles Graph DB re- partitioning System throughput vs cost of re-tuning Model learning Applications of predictive analytics Predictive models are very sensitive to data drift Twitter content analysis Sustained model predictive power over time vs retraining cost Simulation TBD repeated simulation. Computationally expensive but often not beneficial Flood modelling / CityCat Newcastle Computational resources vs marginal benefit of new simulation model
  7. 7. ReComp@Scalable Newcaslte,May9,2016 Data-intensive systems: two properties 1. Observability (transparency) How much of a data-intensive system can we observe? • structure + data flow 2. Control How much control do we have on the system? • Execution frequency, total, partial • Input density
  8. 8. ReComp@Scalable Newcaslte,May9,2016 Observability / transparency White box Black box Structure (static view) Dataflow - eScience Central, Taverna, VisTrails… Scripting: - R, Matlab, Python... - Functions semantics - Packaged components - Third party services Data dependencies (runtime view) Provenance recording: • Inputs, • Reference datasets, • Component versions, • Outputs • Input • Outputs • No data dependencies • No details on individual components Cost • Detailed resource monitoring • Cloud  £££ • Wall clock time • Service pricing • Setup time (eg model learning)
  9. 9. ReComp@Scalable Newcaslte,May9,2016 Example: genomics / variant interpretation What changes: - Patient variants  improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources
  10. 10. ReComp@Scalable Newcaslte,May9,2016 x11 x12 y11 P D11 D12 White box ReComp Observables: Inputs X = {xi1, x12, …} Outputs y = {yi1, yi2,…} Dependencies D11, D12, ... Provenance prov(y) Cost(y) Process structure P = P1.P2…Pk Measurables changes: Input diff: δ(xt, xt+1) Output diff: δ(yt, yt+1) Dependency diff: δ(Dt, Dt+1) For each run i: Control: Complete / partial rerun
  11. 11. ReComp@Scalable Newcaslte,May9,2016 A history of runs x11 x12 y1 P D11 DCV Run 1, Patient A x21 x22 y2 P D21 DCV Run 2, Patient B H = {<x,D,P,y, prov(y), cost(y)>} Note on dependencies Dij: Fine-grained: Dij are the results of a query to a dependent data source (OMIM) Coarse-grained: only record that (a specific version of) OMIM has been used
  12. 12. ReComp@Scalable Newcaslte,May9,2016 ReComp questions • Scoping: Which patients (from a large cohort) are going to be affected by change in input/reference data? • Impact: For each patient in scope, how likely is each patient’s diagnosis to change? Approach: Given Dt+1 and changes δ(Dt, Dt+1): For each patient X and outcome y, query prov(y) to find references to Dt. Patient X is in scope if prov(y) ∩ δ(Dt, Dt+1) is not empty Paolo, Missier, Cala Jacek, and Eldarina Wijaya. “The Data, They Are a-Changin’.” In Proc. TAPP’16 (Theory and Practice of Provenance), edited by Sarah Cohen-Boulakia. Washington D.C., USA: USENIX Association, 2016. https://arxiv.org/abs/1604.06412.
  13. 13. ReComp@Scalable Newcaslte,May9,2016 Control: Simulation rerun Grid resolution Regional boundaries Observables: Inputs X = {xi1, x12, …} Outputs y = {yi1, yi2,…} Cost(y) Measurables changes: Input diff: δ(xt, xt+1) CityCAT (City Catchment Analysis Tool) Inputs • Topography (DEMs from LIDAR) • Physical structures (buildings etc.) • Landuse data Outputs • high resolution grid of flood depths Example: flood modelling in Newcastle
  14. 14. ReComp@Scalable Newcaslte,May9,2016 Example: model learning pattern Observables: Outputs y = {yi1, yi2,…} Cost(y) (retraining) Measurables changes: Output quality relative to ground truth: qty(yt) Black box ReComp: Control: Request to retrain
  15. 15. ReComp@Scalable Newcaslte,May9,2016 Black box ReComp – I • When does the model require retraining? • What is the expected cost and benefit of re-training the model at any given time? The Velox system / Berkeley AMP Lab: Crankshaw, Daniel, Peter Bailis, Joseph E Gonzalez, Haoyuan Li, Zhao Zhang, Michael J Franklin, Ali Ghodsi, and Michael I Jordan. “The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox.” In Procs CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USAhttp://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper19u.pdf.
  16. 16. ReComp@Scalable Newcaslte,May9,2016 Example: time series pattern Control input data drift: • How often and how densely should we sample from the stream to keep the output sufficiently current? Black box ReComp – top-k most frequent NYC taxi routes over time Observables: Outputs y = {yi1, yi2,…} Cost(y) Measurables changes: Output diff: δ(yt, yt+1) Control: Sample frequency / Sample density
  17. 17. ReComp@Scalable Newcaslte,May9,2016 Example: graph repartitioning • Taper Partitioner optimises for a given query workload • Performance of a partitioning is well defined • # inter-partition traversals • Performance degrades when query workload changes 0 500000 1000000 1500000 2000000 2500000 3000000 0% 20% 40% 60% 80% 100% 120% Inter-partitiontraversals % Workload change Inter-partition traversals vs. % Workload change Observables: Outputs y = {yi1, yi2,…} Cost(y) (re-partition) Control: Re-partition requests Measurables changes: Output quality: #ipt
  18. 18. ReComp@Scalable Newcaslte,May9,2016 A summary of ReComp problems Observables: Inputs X = {xi1, x12, …} Outputs y = {yi1, yi2,…} Dependencies D11, D12, ... Provenance prov(y) Cost(y) Process structure P = P1.P2…Pk Measurables changes: Input diff: δ(xt, xt+1) Output diff: δ(yt, yt+1) Dependency diff: δ(Dt, Dt+1) Quality(y) Control: - Data selection - Partial / complete rerun Problem Requires Sample problems Forwards Scoping: Identify affected population subset White box • Dataflow analytics, • Simulation • Many-runs problemsChange Impact analysis: inputs, dependencies  outputs Backwards React to output instability / input drift Black box - Model learning - Time series analytics - Data-driven optimisation
  19. 19. ReComp@Scalable Newcaslte,May9,2016 The next steps -- challenges 1. Optimisation: Observables + control  reactive system + cost and utility functions  optimisation problems 2. Learning from history Can we use history to learn estimates of impact without the need for actual re-computation? 3. Software infrastructure and tooling ReComp is a metadata management and analytics exercise 4. Reproducibility: What really happens when I press the “ReComp” button? 5. Impact: How do address key impact areas - e-health - Genomics - Smart city management

×