Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ReComp: challenges in selective recomputation of (expensive) data analytics tasks

656 views

Published on

Talk given to the Centre for Doctoral Training students in Cloud Computing for Big Data Analytics, School of Computing Science, Newcastle University

Published in: Technology
  • Be the first to comment

ReComp: challenges in selective recomputation of (expensive) data analytics tasks

  1. 1. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier ReComp: preserving the value of big data insights over time Panta Rhei (Heraclitus, through Plato) Paolo Missier Paolo.Missier@ncl.ac.uk November, 2015 Cloud CDT seminar series Newcastle (*) Painting by Johannes Moreelse (*)
  2. 2. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Generating Analytical knowledge Specification Deployment Execution KA Dependencies Algorithms, Libs, Packages System External state (DBs) Input Data config KA = Knowledge Asset Ex.: machine learning Using Python and scikit-learn Learn model to recognise activity pattern Python 3 Ubuntu x.y.z Azure VM Model training Model Scikit-learn Numpy Pandas Ubuntu on Azure Dependencies Training + Testing dataset config
  3. 3. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Generating Analytical knowledge Specification Deployment Execution KA Dependencies Algorithms, Libs, Packages System External state (DBs) Input Data config KA = Knowledge Asset Ex.: workflow to Identify mutations in a patient’s genome Workflow specification WF manager Linux VM cluster on Azure Analyse Input genome variant s GATK/Picard/BWA Workflow Manager (and its own dependencies) Ubuntu on Azure Dep. Input genome config Ref genome Variants DBs
  4. 4. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Rate of change Long-lived Slow-changing Short-lived Fast-changing Input data External DBs Data Streams Historical Time series data Current Twitter graph Reference DBs What changes, and how frequently?
  5. 5. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier How fast does knowledge advance? • Life Sciences knowledge: • Genes (GenBank, Ensembl), Proteins, SNPs, Human Variants DBs (ClinVar) • Life Sciences ontologies (GO, HPO,…) • The human genome assembly • The collection of all PubMed articles • DBPedia, Wikipedia, etc. • All current {Twitter, FB, G+, …} users and their connections • A map of all buildings in a city, with their location and footprint • The Hubble Atlas of Ancient Galaxies • The catalogue of all known Exoplanets (about 2000)
  6. 6. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier How do we know which changes are relevant?
  7. 7. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier What analytics? Genomics • Diagnosis of rare genetic diseases • Analyse soil, water composition (metagenomics) Social media analytics, eg Twitter content analysis • Sentiment analysis • Topic discovery • Emergency response • Fostering new communities Climate modelling • Predicting local climate changes • Ecology: understanding change by monitoring local species Environment risk assessment • Flood modelling and simulation
  8. 8. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Case study: NGS data processing pipeline (Genomics) Recalibration Corrects for system bias on quality scores assigned by sequencer GATK Computes coverage of each read. VCF Subsetting by filtering, eg non-exomic variants Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations Aligns sample sequence to HG19 reference genome using BWA aligner Cleaning, duplicate elimination Picard tools Variant calling operates on multiple samples simultaneously Splits samples into chunks. Haplotype caller detects both SNV as well as longer indels Variant recalibration attempts to reduce false positive rate from caller raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants
  9. 9. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Case study: metagenomics From environment to DNA sequence Sample Size Fractioning DNA extraction Sequencing Analysis? mRNA extraction PCR AmpliconMetatranscriptome Metagenome metagenomics
  10. 10. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Case study: flood modelling in Newcastle CityCAT (City Catchment Analysis Tool) A unique software tool for modelling, analysis and visualisation of surface water flooding • High resolution flood model • Integrates hydraulic modelling algorithms • Subsurface flow modelling • Topography (DEMs from LIDAR) • Physical structures (buildings etc.) • Landuse data • Outputs high resolution grid of flood depths • Extensively tested • Multi-platform • Integrated into CONDOR and Microsoft Azure
  11. 11. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier What kind of changes affect these analytics tasks? Application Knowledge Algorithms and tools LS Diagnosis of rare genetic diseases PubMed Human Variants DBs The human genome assembly SNP DBs Numerous algorithms and tools used for sequence alignment, cleaning, variant calling… LS Metagenomics Collections of known DNA sequences for multiple species Same as for genomics SM Sentiment analysis Past Predictive models Content analysis NLP tools Statistical model learning (classification) SM Topic discovery Clustering algorithms SM Emergency response Content analysis NLP tools Predictive models, topical trend analysis SM Fostering new communities Hubs & authorities algorithms, clustering CS Predicting local climate changes Historical and current time series at multiple resolution Past and current models Statistical model learning CS Ecology: understand change by monitoring local species Local species count & behaviour observations Statistical model learning CE Flood modelling and simulation Local topography, location of buildings Simulation packages (eg CityCat)
  12. 12. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Volume: how many data products are affected? Application Volume LS Diagnosis of rare genetic diseases 100K genome project in the UK alone Thousands of samples in Newcastle alone LS Metagenomics A few K (EBI Metagenomics portal) SM Sentiment analysis # of users whose sentiment is being analysed SM Topic discovery A few clusters, containing a large number of Tweets SM Emergency response A few key decisions SM Fostering new communities A few key users CS Predicting local climate changes Local effect CS Ecology: understand change by monitoring local species Local effects CE Flood modelling and simulation Local effects
  13. 13. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier LS Diagnosis of rare genetic diseases LS Metagenomics SM Sentiment analysis SM Topic discovery SM Emergency response SM Fostering new communities CS Predicting local climate changes CS Ecology: monitoring local species CE Flood modelling and simulation How fast do these products become obsolete? minutes hours months yearsdays
  14. 14. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier How sensitive are data products to change? LS Diagnosis of rare genetic diseases LS Metagenomics SM Sentiment analysis SM Topic discovery SM Emergency response SM Fostering new communities CS Predicting local climate changes CS Ecology: monitoring local species CE Flood modelling and simulation
  15. 15. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier How much do they cost? Note: Cost per product vs Cost over all products Cost components: - Design - Development - System - Runtime LS Diagnosis of rare genetic diseases LS Metagenomics SM Sentiment analysis SM Topic discovery SM Emergency response SM Fostering new communities CS Predicting local climate changes CS Ecology: monitoring local species CE Flood modelling and simulation
  16. 16. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Case study: NGS data processing pipeline (Genomics) Recalibration Corrects for system bias on quality scores assigned by sequencer GATK Computes coverage of each read. VCF Subsetting by filtering, eg non-exomic variants Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations Aligns sample sequence to HG19 reference genome using BWA aligner Cleaning, duplicate elimination Picard tools Variant calling operates on multiple samples simultaneously Splits samples into chunks. Haplotype caller detects both SNV as well as longer indels Variant recalibration attempts to reduce false positive rate from caller raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants
  17. 17. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Workflow Deployment on the Azure Cloud <<Azure VM>> Azure Blob store e-SC db backend <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine Workflow engines Module configuration: 3 nodes, 24 cores Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04.
  18. 18. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Cost 0 2 4 6 8 10 12 14 16 18 0 6 12 18 24 CostinGBP Number of samples 3 eng (24 cores) 6 eng (48 cores) 12 eng (96 cores)
  19. 19. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Changes in reference knowledge (ClinVar DB)
  20. 20. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Case study: the Metagenomics portal at the EBI
  21. 21. From environment to DNA sequence Sample Size Fractioning DNA extraction Sequencing Analysis? mRNA extraction PCR AmpliconMetatranscriptome Metagenome
  22. 22. EBI metagenomics portal Open resource for the archiving and analysis of metagenomics and metatranscriptomics Generic, yet standardised analysis platform for all metagenomics studies Offer a service that small groups would struggle to achieve Submission of sequence data for archiving and analysis Data analysis using selected EBI and external software tools Data presentation and visualisation through web interface Visualisation
  23. 23. Pipeline Overview
  24. 24. Marine Datasets - Portal contains over 30 marine metagenomes MillionsofSequences
  25. 25. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Case study: flood modelling in Newcastle CityCAT (City Catchment Analysis Tool) A unique software tool for modelling, analysis and visualisation of surface water flooding • High resolution flood model • Integrates hydraulic modelling algorithms • Subsurface flow modelling • Topography (DEMs from LIDAR) • Physical structures (buildings etc.) • Landuse data • Outputs high resolution grid of flood depths • Extensively tested • Multi-platform • Integrated into CONDOR and Microsoft Azure
  26. 26. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Fusing UO data and modelling CityCAT Flood model Traffic data Weather data
  27. 27. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier The ReComp project Aims: To create a decision support system for 1. detecting changes that affect time-sensitive analytical knowledge, 2. assessing its reprocessing options, and 3. estimating their cost Change Events Utility function s Priority rules Prioritised KAs Cost estimates Reproducibility assessment ReComp DSS Previously Computed KAs And their metadata Funded by the EPSRC - Making sense from data Feb. 2016- Feb. 2019 2 Research Associates In collaboration with - Newcastle Civil Engineering (Phil James) - Department of Clinical Neurosciences Cambridge University (Prof. Patrick Chinnery)
  28. 28. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier ReComp: Target operating region Rate of change Volume slowfast low high Cost Volume highlow low high Volume Rate of change ReComp target region
  29. 29. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Recomputation analysis: abstraction t1 t2 t3 KA5 KA4 KA3 KA2 KA1a b c a b d a b c d a c Change Events a a’ a a a a b b’ c c’ b,c b b,c c
  30. 30. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Recomputation analysis: conceptual steps Assume we have a growing universe of KA of Knowledge Assets. Each ka ∈ KA has dependencies dep(ka) on other assets in a set DA (input data, algorithms, libs…) ReComp analysis steps: Monitor and detect relevant change events {dai  dai’} with dai ∈ DA For each change event {dai  dai’}: • Identify candidate recomp population karec ⊆ KA: • ka ∈ KA such that dai ∈ dep(ka) • For each ka ∈ karec: • Estimate the effect of recomputing ka using da’i instead of dai • Quantitative estimation of impact due to change dai  dai’ • Determine time, cost associated to recomputing ka • Use these estimates along with utility functions to rank karec • Carry out top-k recomputations given a budget: ka  ka’ • Perform post-hoc analysis to improve estimation models: • Compare actual effect with estimates • Differential data analysis: Δ(ka, ka’) • Change cause analysis: has any other element contributed to Δ(ka, ka’)?
  31. 31. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Recomputation analysis through sampling Change Events Monitor identify recomp candidates prioritisation budgetutility assess effects of change estimate recomp cost assess reproducibility cost sampling recomp recompsmall scale recomp Meta-K large-scale recomp estimate recomp cost
  32. 32. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Recomputation analysis through modelling identify recomp candidates large-scale recomp estimate change impact Estimate reproducibility cost/effort prioritisation target population utility budget Change Events Change Impact Model Cost Model Model updates Model updates Change impact model: Δ(x,x’)  Δ(y,y’) -- challenging!! Can we do better??
  33. 33. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Metadata + Analytics The knowledge is in the metadata! Research hypothesis: supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata items, which describe details of past computations. identify recomp candidates large-scale recomp estimate change impact Estimate reproducibility cost/effort Change Events Change Impact Model Cost Model Model updates Model updates Meta-K • Logs • Provenance • Dependencies
  34. 34. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier High level architecture ReComp decision dashboard Execute Curate Select/ prioritise prospective provenance curation (Yworkflow) Meta-Knowledge Repository Research Objects Change Impact Analysis Cost Estimation Differential Analysis Reproducibility Assessment - Utility functions - Priorities policies - Data similarity functions domain knowledge runtime monitor Logging Runtime Provenance recorder runtime monitor Logging Runtime Provenance recorder Python WP1 - provenance - logs - data and process versions - process dependencies (other analytics environments)
  35. 35. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Project objectives Obj 1. To investigate analytics techniques aimed at supporting re-computation decisions Obj 2. To research techniques for assessing under what conditions it is practically feasible to re-compute an analytical process. • Specific target system environments: • Python / Jupyter • The eScience Central, workflow manager (developed at Newcastle) Obj 3. To create a decision support system for the selective recomputation of complex data-centric analytical processes and demonstrate its viability on two target case studies • Genomics (human variant analysis) • Urban Observatory (flood modelling)
  36. 36. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Expected outcomes Research Outcomes: Algorithms that operate on metadata to perform: • impact analysis • cost estimation • differential data and change cause analysis of past and new knowledge outcomes • estimation of reproducibility effort System Outcomes: • A software framework consisting of domain-independent, reusable components, which implement the metadata infrastructure and the research outcomes • A user-facing decision support dashboard. It must be possible to integrate the framework with domain-specific components, to support specific scenarios, exemplified by our case studies.
  37. 37. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Challenge 1: estimating impact and cost large-scale recomp estimate change impact Estimate reproducibility cost/effort prioritisation target population Change Impact Model Cost Model Model updates Model updates Change impact model: Δ(x,x’)  Δ(y,y’) -- challenging!!
  38. 38. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Challenge 2: managing the metadata How do we generate / capture / store / index / query across multiple metadata types and formats? Relevant Metadata: • Logs of past executions, automatically collected; • Provenance traces: • Runtime (“retrospective”) provenance • Automatically collected data dependency graph captured from the computation • Process structure (“prospective provenance”) • obtained by manually annotating a script • External data and system dependencies, process and data versions, and system requirements
  39. 39. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Challenge 3: Reproducibility Ex.: workflow to Identify mutations in a patient’s genome Workflow specification WF manager Linux VM cluster on Azure Analyse Input genome variant s GATK/Picard/BWA Workflow Manager (and its own dependencies) Ubuntu on Azure Dep. Input genome config Ref genome Variants DBs What happens when any of the dependencies change?
  40. 40. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Challenge 4: reusability of the solution across cases • How do we make case-specific solutions generic? • How do we make the DSS reusable? • Refactor: Generic framework + case-specific components • This is hard: most elements are case-specific! • Metadata formats • Metadata capture • Change impact • Cost models • Utility functions • …
  41. 41. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Available technology components • W3C PROV model for describing data dependencies (provenance) • DataONE “metacat” for data and metadata management • The eScience Central Workflow Management System • Natively provenance-aware • NoWorkflow: a (experimental) Python provenance recorder • Cloud resources: • Azure, our own private cloud (CIC) ReComp decision dashboard Execute Curate Select/ prioritise prospective provenance curation (Yworkflow) Meta-Knowledge Repository Research Objects Change Impact Analysis Cost Estimation Differential Analysis Reproducibility Assessment - Utility functions - Priorities policies - Data similarity functions domain knowledge runtime monitor Logging Runtime Provenance recorder runtime monitor Logging Runtime Provenance recorder Python WP1 - provenance - logs - data and process versions - process dependencies (other analytics environments)
  42. 42. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Specific areas for PhD research Modelling and analytics: • Impact and cost estimation • […] Software engineering • Generic framework + plugins architecture • Metadata management • Capture, storage, index, query • Reproducibility for recomputation • […] Case studies • Genomics • Flood modelling / smart cities • […]
  43. 43. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Summary • Value from Big Data analytics may decay as the resources it is built on change • Resources = {data, external state, algorithms, libs, …} • Value = “Knowledge Assets” (KA) • When should such value be restored? • How do you estimate the cost of re-computation? • How do you prioritise over a large pool of KA for given budget? ReComp: • A decision support tool aimed at answering these questions • Through a metadata management infrastructure with metadata analytics on top
  44. 44. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier References [1] V.Stodden,F.Leisch,andR.D.Peng,Implementingreproducibleresearch.CRCPress,2014. [2] R.Peng,“ReproducibleResearchinComputationalScience,”Science,vol.334,no.6060,pp.1226–1127,Dec-2011. [3] R.Qasha,J.Cala,andP.Watson,“TowardsAutomatedWorkflowDeploymentintheCloudusingTOSCA,”inProcs.IEEE8th International Conference on Cloud Computing (IEEE CLOUD 2015), 2015. [4] D.C.Koboldt,L.Ding,E.Mardis,andR.Wilson,“Challengesofsequencinghumangenomes.,”Brief.Bioinform.,Jun.2010. [5] A.Nekrutenko,“Galaxy:acomprehensiveapproachforsupportingaccessible,reproducible,andtransparentcomputationalresearchin the life sciences,” Genome Biol., vol. 11, no. 8, p. R86, 2010. [6] J.Cala,Y.X.Xu,E.A.Wijaya,andP.Missier,“FromscriptedHPC-basedNGSpipelinestoworkflowsonthecloud,”inProcs.C4Bio workshop, co-located with the 2014 CCGrid conference, 2013. [7] P.Missier,E.Wijaya,R.Kirby,andM.Keogh,“SVI:asimplesingle-nucleotideHumanVariantInterpretationtoolforClinicalUse,”in Procs. 11th International conference on Data Integration in the Life Sciences, 2015. [8] D.G.MacArthur,T.A.Manolio,D.P.Dimmock,H.L.Rehm,etal.,“Guidelinesforinvestigatingcausalityofsequencevariantsinhuman disease.,” Nature, vol. 508, no. 7497, pp. 469–76, Apr. 2014. [9] H.Johnson,R.S.Kovats,G.McGregor,J.Stedman,M.Gibbs,andH.Walton,“Theimpactofthe2003heatwaveondailymortalityin England and Wales and the use of rapid weekly mortality estimates.,” Euro Surveill. Bull. Eur. sur les Mal. Transm. = Eur. Commun. Dis.Bull., vol. 10, no. 7, pp. 168–171, 2005. [10]T. Holderness, S. Barr, R. Dawson, and J. Hall, “An evaluation of thermal Earth observation for characterizing urban heatwave event dynamics using the urban heat island intensity metric,” International Journal of Remote Sensing, vol. 34, no. 3. pp. 864– 884, 2013. [11]L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. Groth, N. Kwasnikowska, S. Miles, P. Missier, et al, “The Open Provenance Model --- Core Specification (v1.1),” Futur. Gener. Comput. Syst., vol. 7, no. 21, pp. 743–756, 2011. [12]H. Hiden, P. Watson, S. Woodman, and D. Leahy, “e-Science Central: Cloud-based e-Science and its application to chemical property modelling,” Newcastle University Technical Report series, http://www.ncl.ac.uk/computing/research/techreports/, 2011. [13]T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, et al.., “YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts,” in Procs. 10th Intl. Digital Curation Conference (IDCC), 2015. [14]S. Bechhofer, D. De Roure, M. Gamble, C. Goble, and I. Buchan, “Research Objects: Towards Exchange and Reuse of Digital Knowledge,” in Procs. Int’l Workshop on Future of the Web for Collaborative Science (FWCS) -- WWW'10, 2010. [15]S. Woodman, H. Hiden, P. Watson, and P. Missier, “Achieving Reproducibility by Combining Provenance with Service and Workflow Versioning,” in Procs. WORKS 2011, 2011. [16]L. Murta, V. Braganholo, F. Chirigati, D. Koop, and J. Freire, “noWorkflow: Capturing and Analyzing Provenance of Scripts⋆,” in Procs.IPAW’14, 2014. [17]L. Moreau, P. Missier, K. Belhajjame, R. B’Far, J. Cheney, S. Coppens, S. Cresswell, Y. Gil, P. Groth, G. Klyne, T. Lebo, J. McCusker,S. Miles, J. Myers, S. Sahoo, and C. Tilmes, “PROV-DM: The PROV Data Model,” 2012. [18]T. Miu and P. Missier, “Predicting the Execution Time of Workflow Activities Based on Their Input Features,” in Procs. WORKS, 2012. [19]P. Missier, S. Woodman, H. Hiden, and P. Watson, “Provenance and data differencing for workflow reproducibility analysis,” Concurr.Comput. Pract. Exp., p. n/a–n/a, 2013.

×