Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The data, they are a-changin’

588 views

Published on

Talk given at TAPP'16 (Theory and Practice of Provenance), June 2016, paper is here:
https://arxiv.org/abs/1604.06412
Abstract:
The cost of deriving actionable knowledge from large datasets has been decreasing thanks to a convergence of positive factors:
low cost data generation, inexpensively scalable storage and processing infrastructure (cloud), software frameworks and tools for massively distributed data processing, and parallelisable data analytics algorithms.
One observation that is often overlooked, however, is that each of these elements is not immutable, rather they all evolve over time.
As those datasets change over time, the value of their derivative knowledge may decay, unless it is preserved by reacting to those changes. Our broad research goal is to develop models, methods, and tools for selectively reacting to changes by balancing costs and benefits, i.e. through complete or partial re-computation of some of the underlying processes.
In this paper we present an initial model for reasoning about change and re-computations, and show how analysis of detailed provenance of derived knowledge informs re-computation decisions.
We illustrate the main ideas through a real-world case study in genomics, namely on the interpretation of human variants in support of genetic diagnosis.

Published in: Technology
  • Be the first to comment

The data, they are a-changin’

  1. 1. TAPP’16 P.Missier,2016 The data, they are a-changin’ (ReComp: Your Data Will Not Stay Smart Forever) Paolo Missier, Jacek Cala, Eldarina Wijaya School of Computing Science, Newcastle University {firstname.lastname}@ncl.ac.uk TAPP’16 McLean, VA, USA June, 2016 (*) Painting by Johannes Moreelse (*) Panta Rhei (Heraclitus, through Plato)
  2. 2. TAPP’16 P.Missier,2016 Data to Knowledge Lots of Data Big Analytics Machine “Valuable Knowledge” Meta-knowledge Algorithms Tools Middleware Reference datasets
  3. 3. TAPP’16 P.Missier,2016 The missing element: time Lots of Data Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Your Data Will Not Stay Smart Forever
  4. 4. TAPP’16 P.Missier,2016 ReComp Observe change • In input data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Lots of Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  5. 5. TAPP’16 P.Missier,2016 The ReComp decision support system Observe change Assess and measure Estimate Enact Change Events Diff(.,.) functions utility functions Impact estimation Cost estimates Reproducibility assessment ReComp Decision Support System History of Knowledge Assets and their metadata Re-computation recommendations
  6. 6. TAPP’16 P.Missier,2016 ReComp concerns 1. Observability (transparency) How much can we observe? • Structure • Data flow 2. Change detection: inputs, outputs, external resources Can we quantify the extent of changes?  diff() functions 4. Control: reaction to changes How much re-computation control do we have on the system? Provenance 3. Impact assessment Can we quantify knowledge decay? Reproducibility - Virtualisation - Smart re-run • Scope: Which instances? • Frequency: how often? • Re-run Extent: how much? Change Events Diff(.,.) functions utility functions Impact estimation Cost estimates Reproducibility assessment ReComp Decision Support System
  7. 7. TAPP’16 P.Missier,2016 Observability / transparency White box Black box Structure (static view) Dataflow - eScience Central, Taverna, VisTrails… Scripting: - R, Matlab, Python... - Packaged components - Third party services Data dependencies (runtime view) Provenance recording: • Inputs, • Reference datasets, • Component versions, • Outputs • Input • Outputs • No data dependencies • No details on individual components Cost • Detailed resource monitoring • Cloud  £££ • Wall clock time • Service pricing • Setup time (eg model learning) This talk: White box ReComp -- initial experiments
  8. 8. TAPP’16 P.Missier,2016 Example: genomics / variant interpretation SVI is a classifier of likely variant deleteriousness: y = {(v, class)|v ∈ varset, class ∈ {red, amber, green}} Uncertain diagnosis Definitely deleterious Definitely benign
  9. 9. TAPP’16 P.Missier,2016 OMIM and ClinVar changes Sources of changes: - Patient variants  improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources CLINVAR / OMIM relevant changes over time for a patient cohort (Newcastle Institute of Genetics Medicine)
  10. 10. TAPP’16 P.Missier,2016 x11 x12 y11 P D11 D12 White box ReComp For each run i: Observables: Inputs X = {xi1, x12, …} Outputs y = {yi1, yi2,…} Dependencies D11, D12, ... Variable-granularity provenance prov(y) Granular Cost(y)  single-block level Granular Process structure P  workflow graph
  11. 11. TAPP’16 P.Missier,2016 White-box provenance x11 x12 y11 P D11 D12 Coarse: Granular:
  12. 12. TAPP’16 P.Missier,2016 A history of runs x11 x12 y1 P D11 DCV Run 1, Patient A x21 x22 y2 P D21 DCV Run 2, Patient B History database:
  13. 13. TAPP’16 P.Missier,2016 ReComp questions • Scope: Which instances? Which patients within the cohort are going to be affected by change in input/reference data? • Re-run Extent: how much? Where in each process instance is the reference data used? • Impact: why bother? For each patient in scope, how likely is that any patient’s diagnosis will change? • Frequency: how often? How often are updates available for the resources we depend on? x11 x12 y11 P D11 D12
  14. 14. TAPP’16 P.Missier,2016 Available Metadata 1. History DB 2. Measurables changes: Input diff:  one patient at a time Output diff:  has the change had any impact? Dependencies  affects entire cohort  scoping Example:
  15. 15. TAPP’16 P.Missier,2016 Querying provenance to determine Scope and Re-run Extent Given observed changes in resources History instance: 1. Scoping: For each Case 1: Granular provenance is in the scope S ⊆ H if Pj is added to Pscope(y) 2. Re-run Extent: 1. Find a partial order on Pscope(y) 2. Re-run starts from each of the earliest Pj such that their output is available as persistent intermediate result see for instance Smart Run Manager [1] [1] Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows, 2005, Wiley.
  16. 16. TAPP’16 P.Missier,2016 Querying provenance to determine Scope and Re-run Extent Scoping: Any instance that depends on any Dij is in scope: Pscope = {Pj}, where: For each Case 2: Coarse-grained provenance Re-run Extent: The mechanism from the fine-grained case still works This is trivial for a homogenenous run population, but H may contain run history for many different workflows!
  17. 17. TAPP’16 P.Missier,2016 Assessing impact and cost Approach: small-scale re-comp over the population in scope 1. Sample instances S’ ⊆ S from the population in scope S 2. Perform partial re-run on each instance h(yi,v) ∈ S’, generating new outputs yi’ 3. Compute 4. Assess impact (user-defined) and cost(y’) 5. Estimate cost difference diff(cost(y), cost(y’))
  18. 18. TAPP’16 P.Missier,2016 ReComp user dashboard and architecture ReComp decision dashboard Execute Curate Select/ prioritise prospective provenance curation (Yworkflow) Meta-Knowledge Repository Research Objects Change Impact Analysis Cost Estimation Differential Analysis Reproducibility Assessment - Utility functions - Priorities policies - Data similarity functions domain knowledge runtime monitor Logging Runtime Provenance recorder runtime monitor Logging Runtime Provenance recorder Python WP1 - provenance - logs - data and process versions - process dependencies (other analytics environments) ReComp is a Decision Support System Impact, cost assessment  ReComp user dashboard
  19. 19. TAPP’16 P.Missier,2016 Current status and Challenges Implementation in progress Small scale experiments on scoping / partial re-run - Test cohort of about 50 (real) patients - Short workflows runs (about 15 mins), observable cost savings - (preliminary results) Main challenge: deliver a generic and reusable DSS From eScience Central  To generic dataflow, scripting (Python) From eSc prov traces  PROV-compliant but idiosincratic patterns Python  noWorkflow traces To: Canonical PROV patterns + queries + H DB implementation ReComp: http://recomp.org.uk/
  20. 20. TAPP’16 P.Missier,2016 References [1] Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows, 2005, Wiley. [2] Ikeda, Robert, Semih Salihoglu, and Jennifer Widom. Provenance-Based Refresh in Data-Oriented Workflows. In Procs CIKM, 2011 [3] R. Ikeda and J. Widom. Panda: A system for provenance and data. Procs TaPP10, 33:1–8, 2010. [4] D. Koop, E. Santos, B. Bauer, M. Troyer, J. Freire, and C. T. Silva. Bridging workflow and data provenance using strong links. In Scientific and statistical database management, pages 397–415. Springer, 2010. ISBN 3642138179. [5] P. Missier, E. Wijaya, R. Kirby, and M. Keogh. SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer.

×