Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

18 views

Published on

a talk given at University of LaRioja -- "provenance week"

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

  1. 1. Paolo Missier and Jacek Cala Newcastle University, UK Universidad La Rioja, Spain Oct 29th, 2019 Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes In collaboration with • Institute of Genetic Medicine, Newcastle University
  2. 2. 2 Context Big Data The Big Analytics Machine Actionable Knowledge Analytics Data Science over time V3 V2 V1 Meta-knowledge Algorithms Tools Libraries Reference datasets t t t
  3. 3. 3 What changes? • Genomics • Reference databases • Algorithms and libraries • Simulation • Large parameter space • Input conditions • Machine Learning • Evolving ground truth datasets • Model re-training
  4. 4. 4 Genomics Image credits: Broad Institute https://software.broadinstitute.org/gatk/ https://www.genomicsengland.co.uk/the-100000-genomes-project/ Spark GATK tools on Azure: 45 mins / GB @ 13GB / exome: about 10 hours
  5. 5. 6 Genomics: WES / WGS, Variant calling  Variant interpretation SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer SVI: Simple Variant Interpretation Variant classification : pathogenic, benign and unknown/uncertain
  6. 6. 7 Changes that affect variant interpretation What changes: - Improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources Evolution in number of variants that affect patients (a) with a specific phenotype (b) Across all phenotypes
  7. 7. 10 Blind reaction to change: a game of battleship Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected 4.2 hours of computation per change Should we care about updates? Evolving knowledge about gene variations
  8. 8. 11 ReComp http://recomp.org.uk/ Outcome: A framework for selective Re-computation • Generic, Customisable Scope: expensive analysis + frequent changes + not all changes significant Challenge: Make re-computation efficient in response to changes Assumptions: Processes are • Observable • Reproducible Estimates are cheap Insight: replace re-computation with change impact estimation Using history of past executions
  9. 9. 12 Reproducibility How Selective: - Across a cohort of past executions.  which subset of individuals? - Within a single re-execution  which process fragments? Change in ClinVar Change in GeneMap  Why, when, to what extent
  10. 10. 13 The rest of the talk • Approach, • Exemplified on the SVI workflow • Architecture
  11. 11. 14 The ReComp meta-process History DB Detect and quantify changes data diff(d,d’) Record execution history Analytics Process P Log / provenance Partially Re-exec P (D) P(D’) Change Events Changes: • Reference datasets • Inputs For each past instances: Estimate impact of changes Impact(dd’, o) impact estimation functions Scope Select relevant sub-processes Optimisation
  12. 12. 15 How much do we know about P? Impact estimation Re-execution less more Process structure Execution trace black box I/O provenance I/O only All-or-nothing monolithic process, legacy  a complex simulator white box step-by-step provenance workflows, R / python code  genomics analyticsTypical process Fine-grained Impact Partial  restart trees (*) (*) Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018. London: Springer; 2018.
  13. 13. 16 Recomp meta-process flow PaoloMissier2019 Identify the subset of executions that are potentially affected by the changes Determine whether changes may have had any impact on outputs Identify and re-execute the minimal fragments of workflow that have been affected
  14. 14. 17 Change Front t C0 {a1 → a0} CF3 {a3, b1, c2} CF5 {a3, b2, c2, d1} C1 {b1 → b0} C3 {a3 → a2, c2 → c1} C4 {d1 → d0} C5 {b2 → b1} C2 {a2 → a1, c1 → c0} E(…, [a0, b0, e0]) E(…, [a0, b1, d0]) E(…, [a2, b1, c1])
  15. 15. 18 Re-computation Front We use: wasInformedBy(..., [prov:type=“recomp:re-execution”]) to denote a ReComp-initiated re-execution.
  16. 16. 19 Re-computation Front
  17. 17. 20 Re-computation Front … user-initiated
  18. 18. 21 Re-computation Front
  19. 19. 22 Restart Tree Re-computation front handles single executions well. What if the process is more complex than that? pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
  20. 20. 23 Restart Tree
  21. 21. 24 Restart Tree
  22. 22. 25 Restart Tree  The provenance trace includes multiple interrelated executions.  During re-execution we have to combine all of them within a single context – the top-level execution.
  23. 23. 26 Execution trace / Provenance User Execution «Association » «Usage» «Generation » «Entity» «Collection» Controller Program Workflow Channel Port wasPartOf «hadMember » «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls[*] [*] [*] [*] [*] [*] «wasDerivedFrom » [*][*] [0..1] [0..1] [0..1] [*][1] [*] [*] [0..1] [0..1] hasOutPort [*][0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*][0..1] connectsTo [*] [0..1] «wasInformedBy » [*][1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPorthadInPort [*][1] [1] [1] [1] [1] hadEntity hasDefaultParam
  24. 24. 27 ProvONE User Execution «Association » «Usage» «Generation » « Controller Program Workflow Channel Port wasPartOf «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls[*] [*] [*] [*] [*] [ [0.. [0..1] [*][1] [*] [*] [0..1] [0..1] hasOutPort [*][0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*][0..1] connectsTo [*] [0..1] «wasInformedBy » [*][1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPorthadInPort [*][1] [1] [1] [1] hadEntity hasDefaultP
  25. 25. 28 Restart Tree To build a restart tree we rely on the provone:wasPartOf statements. CF = {b2, e1}
  26. 26. 29 Restart Tree Captures the vertical dimension of a single execution the transitive closure of the wasPartOf relation. CF = {b2, e1}
  27. 27. 30 Recomp meta-process flow PaoloMissier2019 Determine whether changes may have had any impact on outputs
  28. 28. 31 Changes, data diff, impact 1) Observed change events: (inputs, dependencies, or both) 3) Impact of change C on output y: 2) Type-specific Diff functions: Impact is process- and data-specific:
  29. 29. 32 Impact Given P (fixed), a change in one of the inputs to P: C={xx’} affects a single output: However a change in one of the dependencies: C= {dd’} affects all outputs yt where version d of D was used
  30. 30. 33 SVI: data-diff and impact functions - Data-specific - Process-specificomim clinvar Overall impact impact on ‘p1 select genes’ impact on the SVI output
  31. 31. 34 Diff functions for SVI ClinVar 1/2016 ClinVar 1/2017 diff (unchanged) Relational data  simple set difference
  32. 32. 35 Binary SVI impact function Returns True iff: - Known variants have moved in/out of Red status - New Red variants have appeared - Known Red variants have been retracted
  33. 33. 36 Impact: “semantic “ example Scope: which cases are affected? - Individual variants have an associated phenotype. - Patient cases also have a phenotype “a change in variant v can only have impact on a case X if V and X share the same phenotype” Importance: “Any variant with status moving from/to Red causes High impact on any X that is affected by the variant”
  34. 34. 37 Change impact analysis algorithm PaoloMissier2019 Aim: To identify the minimal subset of observed changes that have an actual effect on past outcomes This is done by progressively eliminating changes for which impact has been estimated as null Intuition: - From the workflow, derive an impact graph - This is a new type of dataflow where execution semantics is designed to - Propagate input changes - Compute diff functions - Compute impact functions on diffs - When impact is null, eliminate changes from the inputs - Input: set of changes, eg - Output: set of bindings that indicates which changes are relevant and have non-zero impact on the process
  35. 35. 38 Role of provenance PaoloMissier2019 Impact facts: - During each execution ReComp records port-data bindings for all the data that flow through annotated input and output ports - Each impact function is able to use some of these bindings as its own inputs - These are the impact facts that the function is evaluated on - To find these bindings, traverse the dependencies of impact to diff functions
  36. 36. 39 ReComp decision matrix for SVI Impact: yes / no / not assessed delta functions: data diff detected?
  37. 37. 40 Empirical validation PaoloMissier2019 re-executions 495  71 Ideal: 14 But: no false negatives
  38. 38. 41 SVI implemented using workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype
  39. 39. 42 History DB: Workflow Provenance Each invocation of an eSC workflow generates a provenance trace “plan” “plan execution” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage ProgramWorkflow Execution Entity (ref data)
  40. 40. 43 Partial re-execution 1. Change detection: A provenance fact indicates that a new version Dnew of database d is available wasDerivedFrom(“db”,Dnew) :- execution(WFexec), wasPartOf(Xexec,WFexec), used(Xexec, “db”) 2.1 Find the entry point(s) into the workflow, where db was used :- execution(WFexec), execution(B1exec), execution(B2exec), wasPartOf(B1exec, WFexec), wasPartOf(B1exec, WFexec), wasGeneratedBy(Data, B1exec), used(B2exec,Data) 2.2 Discover the rest of the sub-workflow graph (execute recursively) 2. Reacting to the change: Provenance pattern: “plan” “plan execution” Ex. db = “ClinVar v.x” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  41. 41. 44 SVI – partial re-execution Overhead: caching intermediate data Time savings Partial re-exec (sec) Complete re-exec Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37 Change in ClinVar Change in GeneMap Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018.
  42. 42. 45 Architecture <eventname> ReComp Core HDB «ProvONE store» Tabular-Diff Service Tabular-Diff Service Difference Function ReExecution Service A ReExecution Service A ReExecution Function Impact Service B Impact Service B Impact Function ReComp Loop User Process Runtime Environment Inputs Outputs Interface Di f f Se r vi c e Interface I mpa c t Se r vi c e Interface Re Exe c Se r vi c e Process and data provenance Prolog facts store/retrieve REST API External services REST API Executes restart trees - React to change events - Construct restart trees
  43. 43. 46 Customising ReComp in practice <eventname> Enable provenance capture / Map to PROV
  44. 44. 47 Summary <eventname> Evaluation: case-by-case basis - Cost savings - Ease of customisation Generic framework Black box  gray box Fine-grained provenance + control  max savings Tested on two cases studies - Genomics - Flood Simulation (not in this talk)

×