Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ReComp project kickoff presentation 11-03-2016

330 views

Published on

Paolo's presentation at the ReComp project kick off meeting, Newcastle, March 2016

Published in: Technology
  • Be the first to comment

  • Be the first to like this

ReComp project kickoff presentation 11-03-2016

  1. 1. ReCompkickoff NewcaslteMarch11,2016 ReComp: preserving the value of big data insights over time Panta Rhei (Heraclitus, through Plato) Project Kickoff March, 2016 (*) Painting by Johannes Moreelse (*)
  2. 2. ReCompkickoff NewcaslteMarch11,2016 Data to Knowledge Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge” The Data-to-knowledge axiom of the Knowledge Economy: What is the Total Cost of Ownership (TCO) of these knowledge assets?
  3. 3. ReCompkickoff NewcaslteMarch11,2016 Learning from data (supervised) Meta-knowledge Training set Model learning Classification algorithms Predictive classifier Background Knowledge (prior)
  4. 4. ReCompkickoff NewcaslteMarch11,2016 Learning from data (unsupervised) Meta-knowledge Observations Model learning Clustering algorithms Clustering scheme Background Knowledge
  5. 5. ReCompkickoff NewcaslteMarch11,2016 Stream Analytics Meta-knowledge Data stream Time Series analysis Pattern recognition algorithms Temporal Patterns Background Knowledge
  6. 6. ReCompkickoff NewcaslteMarch11,2016 The missing element: time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  7. 7. ReCompkickoff NewcaslteMarch11,2016 The ReComp decision support system Observe change • In big data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  8. 8. ReCompkickoff NewcaslteMarch11,2016 ReComp Change Events Diff(.,.) functions “business Rules” Prioritised KAs Cost estimates Reproducibility assessment ReComp DSS Previously Computed KAs And their metadata Observe change Assess and measure Estimate Enact KA: Knowledge Assets
  9. 9. ReCompkickoff NewcaslteMarch11,2016 ReComp: project objectives Obj 1. To investigate analytics techniques aimed at supporting re-computation decisions Obj 2. To research techniques for assessing under what conditions it is practically feasible to re-compute an analytical process. • Specific target system environments: • Python / Jupyter • The eScience Central, workflow manager Obj 3. To create a decision support system for the selective recomputation of complex data-centric analytical processes and demonstrate its viability on two target case studies • Genomics (human variant analysis) • Urban Observatory (flood modelling)
  10. 10. ReCompkickoff NewcaslteMarch11,2016 ReComp: Expected outcomes Research Outcomes: Algorithms that operate on metadata to perform: • impact analysis • cost estimation • differential data and change cause analysis of past and new knowledge outcomes • estimation of reproducibility effort System Outcomes: • A software framework consisting of domain-independent, reusable components, which implement the metadata infrastructure and the research outcomes • A user-facing decision support dashboard. It must be possible to integrate the framework with domain-specific components, to support specific scenarios, exemplified by our case studies.
  11. 11. ReCompkickoff NewcaslteMarch11,2016 ReComp: Target operating region Rate of change Volume slowfast low high Cost Volume highlow low high Volume Rate of change ReComp target region
  12. 12. ReCompkickoff NewcaslteMarch11,2016 Recomputation analysis: abstraction t1 t2 t3 KA5 KA4 KA3 KA2 KA1a b c a b d a b c d a c Change Events a a’ a a a a b b’ c c’ b,c b b,c c
  13. 13. ReCompkickoff NewcaslteMarch11,2016 Recomputation analysis through sampling Change Events Monitor identify recomp candidates prioritisation budgetutility assess effects of change estimate recomp cost assess reproducibility cost sampling recomp recompsmall scale recomp Meta-K large-scale recomp estimate recomp cost
  14. 14. ReCompkickoff NewcaslteMarch11,2016 Recomputation analysis through modelling identify recomp candidates large-scale recomp estimate change impact Estimate reproducibility cost/effort prioritisation target population utility budget Change Events Change Impact Model Cost Model Model updates Model updates Change impact model: Δ(x,x’)  Δ(y,y’) -- challenging!! Can we do better?? - Batching given an allocation of resources? - Consolidating jobs with different resource reqs to optimise resource allocation
  15. 15. ReCompkickoff NewcaslteMarch11,2016 Metadata + Analytics The knowledge is in the metadata! Research hypothesis: supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata items, which describe details of past computations. identify recomp candidates large-scale recomp estimate change impact Estimate reproducibility cost/effort Change Events Change Impact Model Cost Model Model updates Model updates Meta-K • Logs • Provenance • Dependencies
  16. 16. ReCompkickoff NewcaslteMarch11,2016 High level architecture ReComp decision dashboard Execute Curate Select/ prioritise prospective provenance curation (Yworkflow) Meta-Knowledge Repository Research Objects Change Impact Analysis Cost Estimation Differential Analysis Reproducibility Assessment - Utility functions - Priorities policies - Data similarity functions domain knowledge runtime monitor Logging Runtime Provenance recorder runtime monitor Logging Runtime Provenance recorder Python WP1 - provenance - logs - data and process versions - process dependencies (other analytics environments)
  17. 17. ReCompkickoff NewcaslteMarch11,2016 Available technology components • W3C PROV model for describing data dependencies (provenance) • DataONE “metacat” for data and metadata management • The eScience Central Workflow Management System • Natively provenance-aware • NoWorkflow: a (experimental) Python provenance recorder • Cloud resources: • Azure, our own private cloud (CIC) ReComp decision dashboard Execute Curate Select/ prioritise prospective provenance curation (Yworkflow) Meta-Knowledge Repository Research Objects Change Impact Analysis Cost Estimation Differential Analysis Reproducibility Assessment - Utility functions - Priorities policies - Data similarity functions domain knowledge runtime monitor Logging Runtime Provenance recorder runtime monitor Logging Runtime Provenance recorder Python WP1 - provenance - logs - data and process versions - process dependencies (other analytics environments)
  18. 18. ReCompkickoff NewcaslteMarch11,2016 Challenge 1: estimating impact and cost large-scale recomp estimate change impact Estimate reproducibility cost/effort prioritisation target population Change Impact Model Cost Model Model updates Model updates Change impact model: Δ(x,x’)  Δ(y,y’) -- challenging!!
  19. 19. ReCompkickoff NewcaslteMarch11,2016 Challenge 2: managing the metadata How do we generate / capture / store / index / query across multiple metadata types and formats? Relevant Metadata: • Logs of past executions, automatically collected; • Provenance traces: • Runtime (“retrospective”) provenance • Automatically collected data dependency graph captured from the computation • Process structure (“prospective provenance”) • obtained by manually annotating a script • External data and system dependencies, process and data versions, and system requirements
  20. 20. ReCompkickoff NewcaslteMarch11,2016 Challenge 3: Reproducibility Ex.: workflow to Identify mutations in a patient’s genome Workflow specification WF manager Linux VM cluster on Azure Analyse Input genome variant s GATK/Picard/BWA Workflow Manager (and its own dependencies) Ubuntu on Azure Dep. Input genome config Ref genome Variants DBs What happens when any of the dependencies change?
  21. 21. ReCompkickoff NewcaslteMarch11,2016 Challenge 4: reusability of the solution across cases • How do we make case-specific solutions generic? • How do we make the DSS reusable? • Refactor: Generic framework + case-specific components • This is hard: most elements are case-specific! • Metadata formats • Metadata capture • Change impact • Cost models • Utility functions • …

×