Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analytics of analytics pipelines: from optimising re-execution to general Data Provenance for Data Science

An invited talk (thanks!) given to the Office of National Statistics in March 2021

  • Be the first to comment

  • Be the first to like this

Analytics of analytics pipelines: from optimising re-execution to general Data Provenance for Data Science

  1. 1. Paolo Missier School of Computing Newcastle University, UK March 2021 Analytics of analytics pipelines: from optimising re-execution to general Data Provenance for Data Science paolo.missier@ncl.ac.uk LinkedIn: paolomissier Twitter: @PMissier
  2. 2. 2 Outline ONS March 2021 P. Missier 1. ReComp: a framework to enable the selective re-computation of expensive analytics workflows 2. Data Provenance for Data Science
  3. 3. 3 Context Big Data The Big Analytics Machine Actionable Knowledge Analytics Data Science over time V3 V2 V1 Meta-knowledge Algorithms Tools Libraries Reference datasets t t t
  4. 4. 4 What changes? Life Sciences, Health care Reference databases Algorithms and libraries Simulation Large parameter space Input conditions Machine Learning Evolving ground truth datasets Model re-training
  5. 5. 5 Motivating example: Genomics pipelines Image credits: Broad Institute https://software.broadinstitute.org/gatk/ https://www.genomicsengland.co.uk/the-100000-genomes-project/ Spark GATK tools on Azure: 45 mins / GB @ 13GB / exome: about 10 hours
  6. 6. 6 Genomics: WES / WGS, Variant calling  Variant interpretation SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer SVI: Simple Variant Interpretation Variant classification : pathogenic, benign and unknown/uncertain
  7. 7. 7 Changes that affect variant interpretation What changes: - Improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources Evolution in number of variants that affect patients (a) with a specific phenotype (b) Across all phenotypes
  8. 8. 8 Blind reaction to change: a game of battleship Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected 4.2 hours of computation per change Should we care about updates? Evolving knowledge about gene variations
  9. 9. 9 Reacting to changes in inputs x1 x2 y1 d1 d2 f() 1. Always refresh 2. Approximate
  10. 10. 10 3. A refresh-if-needed approach
  11. 11. 11 f(.) unstable  heuristics Impact: “Any variant with status moving from/to Red causes High impact on any patient who is affected by the variant” Observation: Variants v within output set y that are in scope for patient X remain in scope! (monotonicity) 1. Variant v changes status - unknown  benign - unknown  deleterious 2. Brand new variant  If in scope: compare status before / after inexpensive  recompute SVI on all inputs expensive Scope: which cases are affected? “a change in variant v can only have impact on a case X if V and X share the same phenotype”
  12. 12. 12 Empirical evaluation re-executions 495  71 Ideal: 14 But: no false negatives
  13. 13. 13 ReComp http://recomp.org.uk/ Outcome: A framework for selective Re-computation • Generic, Customisable Scope: expensive analysis + frequent data changes + not all data changes significant Challenge: Make re-computation efficient in response to changes Assumptions: Processes are • Observable • Reproducible Estimates are cheap Insight: replace re-computation with change impact estimation Using history of past executions
  14. 14. 14 Data Provenance in ReComp Hypothesis: collecting detailed {provenance, logs} from past executions helps optimizing future executions 2. Identify and re-execute the minimal fragments of workflow that have been affected 1. Identify the subset of executions that are potentially affected by the changes
  15. 15. 15 Reproducibility How Selective: - Across a cohort of past executions.  which subset of individuals? - Within a single re-execution  which process fragments? Change in ClinVar Change in GeneMap  Why, when, to what extent
  16. 16. 16 The ReComp meta-process History DB Detect and quantify changes data diff(d,d’) Record execution history Analytics Process P Log / provenance Partially Re-exec P (D) P’(D’) Change Events Changes: • Reference datasets • Inputs For each past instance: Estimate impact of changes Impact(dd’, o) impact estimation functions Scope Select relevant sub-processes Optimisation
  17. 17. 17 How much do we know about the process? Impact estimation Re-execution less more Process structure Execution trace black box I/O provenance IO, DO All-or-nothing monolithic process, legacy  a complex simulator white box step-by-step provenance workflows, R / python code  genomics analytics Typical process Fine-grained Impact Partial  restart trees (*) (*) Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018. London: Springer; 2018.
  18. 18. 18 Provenance of process executions Process, workflow run Data wasAssociatedWith cooking recipe chef finished dish wasGeneratedBy A plan plays a role in an association Activity: workflow run Data product Plan
  19. 19. 19 Execution trace / Provenance User Execution «Association » «Usage» «Generation » «Entity» «Collection» Controller Program Workflow Channel Port wasPartOf «hadMember » «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls [*] [*] [*] [*] [*] [*] «wasDerivedF [*] [*] [0..1] [0..1] [0..1] [*] [1] [*] [*] [0..1] [0..1] hasOutPort [*] [0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*] [0..1] connectsTo [*] [0..1] «wasInformedBy » [*] [1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPort hadInPort [*] [1] [1] [1] [1] [1] hadEntity hasDefaultParam
  20. 20. 20 History DB: Workflow Provenance Each invocation of a workflow generates a provenance trace “plan” “plan execution” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usage generation association association association db usage Program Workflow Execution Entity (ref data)
  21. 21. 21 SVI implemented using workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype
  22. 22. 22 SVI – partial re-execution Overhead: caching intermediate data Time savings Partial re-exec (sec) Complete re-exec Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37 Change in ClinVar Change in GeneMap Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018.
  23. 23. 24 ReComp: Summary Evaluation: case-by-case basis - Cost savings - Ease of customisation Generic ReComp framework: - Observe changes, Provenance DB (History), control re-exec Customisation: - Diff functions, impact functions Fine-grained provenance + control  max savings
  24. 24. 25 Data Provenance for Data Science
  25. 25. 26 Data  Model  Predictions Model pre-processing Raw datasets features Predicted you: - Ranking - Score - Class Data collection Instances Key decisions are made during data selection and processing: - Where does the data come from? - What’s in the dataset? - What transformations were applied? Complementing current ML approaches to model interpretability 1. Can we explain these decisions? 2. Are these explanations useful?
  26. 26. 27 Explaining data preparation Data collection Model Population data pre-processing Raw datasets features Predicted you: - Ranking - Score - Class - Integration - Cleaning - Outlier removal - Normalisation - Feature selection - Class rebalancing - Sampling - Stratification - … Data acquisition and wrangling: - How were datasets acquired? - How recently? - For what purpose? - Are they being reused / repurposed? - What is their quality? Instances - Scripts  Python / TensorFlow, Pandas, Spark - Workflows  Knime, … Provenance  Transparency
  27. 27. 29 Typical operators used in data prep
  28. 28. 35 Recent early results A small grassroots project… [1] - Formalisation of provenance patterns for pipeline operators - Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines - Reality check: - How much does it cost?  provenance volume - Does it help?  queries against the provenance database [1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
  29. 29. 36 Operators 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op Data reduction - Feature selection - Instance selection Data augmentation - Space transformation - Instance generation - Encoding (eg one-hot…) Data transformation - Data repair - Binarisation - Normalisation - Discretisation - Imputation Ex.: vertical augmentation  adding columns
  30. 30. 37 Code instrumentation Create a provlet for a specific transformation Initialize provenance capture …code injection is now being automated!
  31. 31. 38 Provenance patterns
  32. 32. 39 Provenance templates Template + binding rules = instantiated provenance fragment 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’} +
  33. 33. 40 This applies to all operators…
  34. 34. 41 Putting it all together
  35. 35. 42 Evaluation - performance
  36. 36. 43 Evaluation: Provenance capture and query times
  37. 37. 44 Scalability
  38. 38. 45 Summary Multiple hypotheses regarding Data Provenance for Data Science: 1. Is it practical to collect fine-grained provenance? 1. To what extent can it be done automatically? 2. How much does it cost? 2. Is it also useful?  what is the benefit to data analysts? Work in progress! Interest? Ideas?

×