Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control

98 views

Published on

The keynote talk from RAPIDS 2018 in London.

Dr Stephen J Newhouse and Luke Marsden explain why now is the moment to take Reproducibility and Provenance in Data Science (RAPIDS) seriously, and how this can be achieved with process and tooling.

Stephen shares his experiences of the challenges in the industry and Luke introduces the beta version of Dotscience, a tool for model tracking and collaboration through RAPIDS.

Published in: Data & Analytics
  • Be the first to comment

RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control

  1. 1. Welcome & Today’s Schedule Registration and breakfast How I learned to stop worrying and love version control – Dr Stephen J Newhouse and Luke Marsden Effective computing for research reproducibility – Dr Laura Fortunato Morning Break A crazy little thing called reproducible science – Dr Tania Allard Machine Learning in Production - A practical approach to continuous deployment of Machine Learning pipelines – Luca Palmieri & Christos Dimitroulas Lunch Version Control for your Model, Data and Environment – Workshop Networking drinks 09:00 10:00 13:00 14:00 – 17:30 18:00 10:40 11:20 11:40 12:20 #rapids2018 @getdotmesh
  2. 2. #rapids2018 @getdotmesh
  3. 3. Thank you! Please tweet! #rapids2018 @getdotmesh
  4. 4. How I learned to stop worrying and love version control Steve Newhouse & Luke Marsden #rapids2018 @getdotmesh
  5. 5. Who am I? Dr. Stephen J Newhouse Lead Data Scientist & Senior Bioinformatician ⇢ KCL Department of Biostatistics and Health Informatics ⇢ NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust ⇢ UCL Institute of Health Informatics & Health Data Research (HDR) UK #rapids2018 @getdotmesh
  6. 6. Our (broad) interests... ⇢ From Bench (to Computer) to Bedside… ⇢ Collaborative & Open Research… ⇢ Data & knowledge sharing… ⇢ DevOps applied to Health Care/Research ⇢ Personalised Medicine & the Quantified Self #rapids2018 @getdotmesh
  7. 7. Provenance & Reproducibility #rapids2018 @getdotmesh
  8. 8. “Provenance is the the Missing Feature for Rigorous Data Science Joe Doliner Co-Founder, CEO of Pachyderm #rapids2018 @getdotmesh
  9. 9. “ Hazy: Making it Easier to Build and Maintain Big-data Analytics The next breakthrough in data analysis may not be in individual algorithms, but in the ability to rapidly combine, deploy, and maintain existing algorithms #rapids2018 @getdotmesh
  10. 10. “The next breakthrough in data analysis may not be in individual algorithms, but in the ability to rapidly combine, deploy, and maintain existing algorithms In A Fully Reproducible Way And Under Complete Provenance Steve Newhouse, RAPIDS 2018 #rapids2018 @getdotmesh
  11. 11. Provenance & Reproducibility ⇢ Provenance: “Origin of something…” ⇢ Track & Document the data source and the “models” ⇢ Who, What, Where, Why, When and How at every stage of ETL,EDA,ML,Reporting ⇢ Captures dependency between data sets: Enables reproducibility ⇢ Results can be traced back to their origins and recomputed from scratch: + Good for REPRODUCIBILITY + Good Practice (should be BEST PRACTICE) + Good for GDPR! #rapids2018 @getdotmesh
  12. 12. Why RAPIDS is important to me #rapids2018 @getdotmesh
  13. 13. HDR UK: A new national Institute for health data science ⇢ Structured and unstructured (e.g. imaging, text) data for derivation of new or deep phenotypes ⇢ Adding value at scale to existing world-leading cohorts in the UK ⇢ Demonstrating system-wide opportunities for research that improves quality of care ⇢ Enable large scale, high-throughput research that combines genomic data with electronic health records ⇢ Genomics, epigenomics, statistical and complex genetics, population genetics, cancer ‘omics’, molecular epidemiology Actionable Health Data Analytics Precision Medicine #rapids2018 @getdotmesh
  14. 14. HDR UK: A new national Institute for health data science ⇢ Transform Phase II – Phase IV clinical trials including ‘real world evidence’ studies ⇢ Towards prevention & early intervention ⇢ Ability to link health and administrative datasets across multiple environments ⇢ New technologies, from sensors to wearable devices to artificial intelligence 21st Century Trial Design Modernising Public Health Training Future Leaders in health data science #rapids2018 @getdotmesh
  15. 15. Large-scale machine learning & mixed ‘omic analysis strategies for patient care... #rapids2018 @getdotmesh
  16. 16. Investigator Monitor Laboratories Data Manager Analyst Clinician Results Clinical Data It’s All About The Data and #datasaveslives #rapids2018 @getdotmesh
  17. 17. How did I learn to stop worrying and love version control? #rapids2018 @getdotmesh
  18. 18. I made the transition from Lab tech to Bioinformatician #rapids2018 @getdotmesh
  19. 19. The things you discover... new ways of working that just made sense #rapids2018 @getdotmesh
  20. 20. And the crap we do/did that contributed to… The Reproducibility Crisis #rapids2018 @getdotmesh
  21. 21. My new world ⇢ Git, GitHub & READMEs... ⇢ R & Python Notebooks... ⇢ Meta-Data, Ontologies... ⇢ Shared Code… ⇢ Open Science, Open Data… ⇢ FAIR Principles… ⇢ Community! #rapids2018 @getdotmesh
  22. 22. My old world ⇢ I started in the lab...I was a Molecular Biologist/Geneticist ⇢ Provenance & Reproducibility == The Lab book, Publications… ⇢ Data stored on local Internal/External and University Drives/HPC #rapids2018 @getdotmesh
  23. 23. Lab books: RAPIDS the old way ⇢ Record Everything ⇢ Name, Version, Project, Date. ⇢ Materials & Methods.. ⇢ Signed off (sometimes) ⇢ Double Checked (sometimes) ⇢ Varying level of Detail: one liners to prose... We (Lab folks) are kind of doing it anyway...with lab books #rapids2018 @getdotmesh
  24. 24. How we (Basic Academia) often do Version Control ⇢ Formal Version Control & Data Provenance? Nope, not really ⇢ Documentation/Report: Excel, Word, Powerpoint & Images...cut and paste into lab book...heavily edited for publication... ⇢ Analysis: GUIs/SPSS, barplots in Excel...never record steps taken or software versions until publication... ⇢ Data: Local HDD, HPC, Dropbox...often only location is recorded… ⇢ Document & Data versions can often be overwritten or lost and then there is this... #rapids2018 @getdotmesh
  25. 25. #rapids2018 @getdotmesh
  26. 26. There is little to no formal Data/Model provenance & Version Control: The Story We Tell Raw Data v1 v2 Finalv3 Publication 1 Publication 2 Publication 3 #rapids2018 @getdotmesh
  27. 27. There is little to no formal Data/Model provenance & Version Control: The Truth... Raw Data my_d ata v1.x x v2 Final xxx zzz Final Final v2xy Publication 1 Publication 2 Publication 3 The Truth... #rapids2018 @getdotmesh
  28. 28. In Academia, a lot of folks* dont do Version Control & Provenance. If they do, its haphazard and under duress** It is not standard operating procedure! *Academics/basic researchers: Statisticians, Economists, Bio/health-informaticians, Biologists and clinicians who can do R/Python/Stata/SAS/SPSS. **Extra work needed in keeping lab books, documenting everything, cleaning code, sharing code...not used to this way of working “ #rapids2018 @getdotmesh
  29. 29. Why? #rapids2018 @getdotmesh
  30. 30. Culture Lack of awareness Lack of education #rapids2018 @getdotmesh
  31. 31. It is not enforced in many labs Incentives are not aligned with RAPIDS Pressure to Publish Quickly #rapids2018 @getdotmesh
  32. 32. “[I was] completely unaware of robust solutions & common best practices from the Software/DevOps World… So were my supervisors! #rapids2018 @getdotmesh
  33. 33. The term Provenance was never mentioned Replication/Reproducibility were just catch-phrases Plus, this was all supposed to be captured in our Lab books… and then in the publication? Right?.... #rapids2018 @getdotmesh
  34. 34. #rapids2018 @getdotmesh
  35. 35. I may be a bit cynical but... #rapids2018 @getdotmesh
  36. 36. Some sobering reading The public are on to us the – ‘shoddy’ scientists ⇢ “Too many of the findings that fill the academic ether are the result of shoddy experiments or poor analysis” / The Economist ⇢ How science goes wrong / The Economist ⇢ Trouble at the lab / The Economist ⇢ Is It Tough Love Time For Science? / The Big Think ⇢ Some of this is purely down to bad data management (& bad practices & lack of awareness & lack of education & so on and so forth…) #rapids2018 @getdotmesh
  37. 37. Garbage in, Garbage out (GIGO) Bad/NO Data Management/Experimental Design/Analysis Plan ⇢ Spurious results/False positives and negatives ⇢ Translational research suffers ⇢ The patients suffer ⇢ Lies are published ⇢ Time & Money wasted (Charity, Public, Private…) ⇢ There is no real progress ⇢ Serious Legal & Ethical implications: GDPR! #rapids2018 @getdotmesh
  38. 38. The Reproducibility Crisis: It's not just the fields of psychology & medicine... #rapids2018 @getdotmesh
  39. 39. Matthew Hutson said... “Artificial intelligence faces reproducibility crisis” Matthew Hutson, Science 2018 #rapids2018 @getdotmesh
  40. 40. AI faces a reproducibility crisis ⇢ “I think people outside the field might assume that because we have code, reproducibility is kind of guaranteed,” …. “Far from it.” ⇢ The most basic problem is that researchers often don’t share their source code (and their Data) ⇢ “The exact way that you run your experiments is full of undocumented assumptions and decisions,”....“A lot of this detail never makes it into papers.” ⇢ “No time to document every hyperparameter”... #rapids2018 @getdotmesh
  41. 41. Some common issues ⇢ Common misconception: Only CS/Software Devs need to do it ⇢ Lack of awareness from the Top-down & bottom-up: the lab lead/PI does not know about GIT and/or has not signed up to OPEN SCIENCE ⇢ Personalities/Culture/Environment: Why should I share? Data Hoarding… ⇢ Fear of being judged, Fear of the unknown, Fear of the command line ⇢ Laziness? - Adding extra steps to their workflow - “I have to do what now???” #rapids2018 @getdotmesh
  42. 42. There is a need for Reproducibility and Provenance in Data Science #rapids2018 @getdotmesh
  43. 43. There is a need for Reproducibility and Provenance in Everything #rapids2018 @getdotmesh
  44. 44. “One of the largest sources of error in [Data] Science results from computing [and publishing] results from different versions of the same data set. #rapids2018 @getdotmesh
  45. 45. And using different versions of the same software… #rapids2018 @getdotmesh
  46. 46. And using different versions/implementations of the “same” algorithm #rapids2018 @getdotmesh
  47. 47. And failing to capture & share all the steps taken when building your ML/AI model… Seed? Hyperparameters? Training Split? Features? Precision? Recall? Time of Day? #rapids2018 @getdotmesh
  48. 48. And failing to capture the state of your Data at each iteration of the analysis... #rapids2018 @getdotmesh
  49. 49. The wider community is aware of this There are solutions We are getting better at it #rapids2018 @getdotmesh
  50. 50. Now over to Luke... (No, not that one) #rapids2018 @getdotmesh
  51. 51. Who am I? Luke Marsden Founder & CEO of dotmesh ⇢ Hacker & entrepreneur ⇢ Developed first storage system & volume plugin system for Docker ⇢ Kubernetes SIG lead ⇢ Formerly Computer Science @ Oxford #rapids2018 @getdotmesh
  52. 52. So you want to do reproducible data science/AI/ML? What do you need to pin down? #rapids2018 @getdotmesh
  53. 53. So you want to do reproducible data science/AI/ML? Environment #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  54. 54. So you want to do reproducible data science/AI/ML? Environment Code Including parameters #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  55. 55. So you want to do reproducible data science/AI/ML? Environment Code Including parameters Data #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  56. 56. How? #rapids2018 @getdotmesh
  57. 57. Pinning down environment ⇢ In the DevOps world, Docker has been a big hit. ⇢ Docker helps you pin down the execution environment that your model training (or other data work) is happening in. ⇢ What is Docker? #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  58. 58. What is docker? ⇢ Like tiny frozen, runnable copies of your computer's filesystem - e.g. Python libraries, Python versions ⇢ You can determine the exact version of all the dependencies of your data science code ⇢ You can build, ship & run exactly the same thing anywhere… your laptop, a cluster, or the cloud ⇢ Dockerfile lets you declare what versions of things you want; build a dockerfile from a docker image and push it to a registry #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  59. 59. Pinning down code ⇢ For decades developers have been version controlling their code. ⇢ Tools like git are very popular. #rapids2018 @getdotmesh
  60. 60. What is git? ⇢ git looks kinda scary - but it's worth persisting ⇢ In data science, it's not natural to commit every time you change anything, e.g. while tuning parameters... ⇢ ...but you generate results while you're iterating A version control system. Lets you track versions of your code and collaborate with others by commit, clone, push, pull… Problems: #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  61. 61. Pinning down data ⇢ Method one: be very very organised (meticulous folder structure) + Never overwrite files… backup frequently… and get your whole team to do the same ⇢ Method two: use versioned S3 buckets #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  62. 62. What is S3? ⇢ When you run an experiment, not natural to note down all the object versions ⇢ You generally care about the version of the whole bucket, not every single individual object (but S3 has no such notion) ⇢ You could build a system to track this, but you've got more important science to be doing... A scalable filesystem on Amazon Web Services. Store lots of data quite cheaply. Version your objects (files) so that you can solve the problem of data changing "under your feet". Problems: #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  63. 63. So you want to track provenance in data science/AI/ML? What do you need to pin down? #rapids2018 @getdotmesh
  64. 64. So you want to track provenance in data science/AI/ML? Data A Data B Data C Code 2Code 1 Model 1 Model 2 Input Input Output Output Output Input #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  65. 65. Pinning down data provenance ⇢ CWL, Pachyderm ⇢ Require you to define the data pipeline up front If you can record the graph, you can point to any artefact/model and ask "show me exactly where this came from"... the exact version of the tool which generated it, what input data that tool used. And the transitive closure thereof. Possible tools: ⇢ You don't always know the data pipeline up front ⇢ Often you're figuring it out as you go along, and it's evolving... Problems: #rapids2018 @getdotmesh
  66. 66. One more problem ⇢ Sad fact. People don't care about reproducibility as much as their day to day work (getting a paper published, shipping an optimised model to production, …) ⇢ Can we introduce reproducibility & provenance to people while also helping them get work done faster and more accurately? And collaborate better with their team? #rapids2018 @getdotmesh
  67. 67. The sweetener - track summary stats ⇢ How do you track the progress/performance/results of your models? Your data science team? ⇢ Answers ranged from “in a google spreadsheet” to “in text file”, “on a piece of paper” or even “verbally”! ⇢ Ideally, integrate summary stats tracking into a solution... We asked dozens of data scientists to describe their workflows and their pain points. One problem stood out… #rapids2018 @getdotmesh
  68. 68. If only this was all a bit easier... #rapids2018 @getdotmesh
  69. 69. The reason we're running this event today #rapids2018 @getdotmesh
  70. 70. Introducing dotscience ⇢ Tracks environment with Docker ⇢ Tracks data in versioned S3 buckets + dotmesh filesystem ⇢ Tracks code versions which generate summary stats in dotmesh + diff against git ⇢ Integrates with Jupyter (RStudio & scripts coming soon) Environment CodeData Solves reproducibility: #rapids2018 @getdotmesh
  71. 71. Introducing dotscience ⇢ Builds the provenance graph on the fly ⇢ For any dataset, see what code generated it as the output of which other code, transitively ⇢ For any model, see exactly what code generated it, and what data that model was trained on Solves provenance: Data C Code 2 Model 1 Model 2 Input Output Output #rapids2018 @getdotmesh
  72. 72. Introducing dotscience ⇢ Builds a table and chart of every run. Snapshots and keeps together: + versioned dataset + versioned model + all model parameters + compute environment ⇢ See performance not just of your work over time, but your whole team. Solves summary stats tracking: Who When Parameters Error rate Alice 2 minutes ago filter_snps=150 60% Bob 2 hours ago filter_snps=200 30% Charlie 12 hours ago filter_snps=100 50% #rapids2018 @getdotmesh
  73. 73. Live demo time! #rapids2018 @getdotmesh
  74. 74. You can try this yourself this afternoon! #rapids2018 @getdotmesh
  75. 75. Roadmap for dotscience ⇢ Cloud Storage ⇢ R & RStudio, scripts, 'ds run' CLI support ⇢ Cluster support - Kubernetes ⇢ Spark/HDFS, MLlib ⇢ Slice & dice ⇢ Collaboration ⇢ Search and discovery ⇢ Multi-tenant execution, 1-click cluster installer, local installers #rapids2018 @getdotmesh
  76. 76. We need your help! #rapids2018 @getdotmesh
  77. 77. Thanks, questions? beta.dotscience.io slack.dotscience.io @lmarsden @s_j_newhouse #rapids2018 @getdotmesh

×