Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CD4ML - ThoughtWorks MeetUp Munich Christoph Windheuser May 8th 2019

185 views

Published on

These are the slides of Christoph Windheuser at the MeetUp at ThoughtWorks in Munich on May 8th, 2019. Christoph spoke about how to build up a Continuous Delivery (CD) framework for Machine Learning and Data Science applications in the industry.

Published in: Software
  • Be the first to comment

  • Be the first to like this

CD4ML - ThoughtWorks MeetUp Munich Christoph Windheuser May 8th 2019

  1. 1. 1 Continuous Intelligence Applying Continuous Delivery for Machine Learning Christoph Windheuser Global Head of Artificial Intelligence ThoughtWorks Inc. Munich, May 8, 2019 ©ThoughtWorks 2019
  2. 2. 5000+ technologists with 40 offices in 14 countries Partner for technology driven business transformation ©ThoughtWorks 2019 join.thoughtworks.com
  3. 3. #1 in Agile and Continuous Delivery 100+ books written ©ThoughtWorks 2019
  4. 4. ©ThoughtWorks 2019
  5. 5. TECHNIQUES Continuous delivery for machine learning (CD4ML) models #8 TRIAL 8 ©ThoughtWorks 2019
  6. 6. 6 CONTINUOUS INTELLIGENCE ©ThoughtWorks 2019
  7. 7. ©ThoughtWorks 2018 Commercial in Confidence CONTINUOUS INTELLIGENCE CYCLE ©ThoughtWorks 2019 7
  8. 8. 8 PRODUCTIONIZING ML IS HARD ● High number of changing artifacts ● Size and portability of the artifacts ● Different skills and working processes in the workforce with “throw over the fence” attitude ● Serial and parallel processing ● Models must be continuously monitored and improved ©ThoughtWorks 2019
  9. 9. DIFFERENT ARCHETYPES: MANY SOURCES OF CHANGE 9 ModelData Code + + Schema Sampling over Time Volume ... Research, Experiments Training on Data Accuracy ... New Features Bug Fixes Performance ... Icons created by Noura Mbarki and I Putu Kharismayadi from Noun Project ©ThoughtWorks 2019
  10. 10. 10 CONTINUOUS INTEGRATION / CONTINUOUS DELIVERY ©ThoughtWorks 2019
  11. 11. 11 CI CD CI / CD ©ThoughtWorks 2019
  12. 12. “Continuous Delivery is the ability to get changes of all types — including new features, configuration changes, bug fixes and experiments — into production, or into the hands of users, safely and quickly in a sustainable way.” - Jez Humble & Dave Farley 12 ©ThoughtWorks 2019
  13. 13. ©ThoughtWorks 2019 PRINCIPLES OF CONTINUOUS DELIVERY 13 → Create a Repeatable, Reliable Process for Releasing Software → Automate Almost Everything → Build Quality In → Work in Small Batches → Keep Everything in Source Control → Done Means “Released” → Improve Continuously
  14. 14. 14 CD4ML ©ThoughtWorks 2019
  15. 15. PUTTING EVERYTHING TOGETHER 15 Data Science, Model Building Training Data Source Code + Executables Model Evaluation Productionize Model Integration Testing Deployment Test Data Model + parameters CD Tools and Repositories DiscoverableandAccessibleData Monitoring ©ThoughtWorks 2019 Production Data
  16. 16. WHAT DO WE NEED IN OUR STACK? 16 Doing CD with Machine Learning is still a hard problem MODEL PERFORMANCE ASSESSMENT VERSION CONTROL AND ARTIFACT REPOSITORIES ©ThoughtWorks 2019 MONITORING AND OBSERVABILITY DISCOVERABLE AND ACCESSIBLE DATA CONTINUOUS DELIVERY ORCHESTRATION TO COMBINE PIPELINES INFRASTRUCTURE FOR MULTIPLE ENVIRONMENTS AND EXPERIMENTS
  17. 17. REAL WORLD EXAMPLE 17 There are many options for tools and technologies to implement CD4ML ©ThoughtWorks 2019
  18. 18. MACHINE LEARNING PIPELINE 18 ©ThoughtWorks 2019 18
  19. 19. BASIC DATA SCIENCE WORKFLOW 19 Gather data and extract features Separate into training and validation sets Train model and evaluate performance ©ThoughtWorks 2019
  20. 20. SALES FORECAST MODEL TRAINING PROCESS 20 Data splitter.p y Training Data Validation Data decision_t ree.py model.pkl evaluation.py metrics.json download_d ata.py ©ThoughtWorks 2019
  21. 21. CHALLENGE 1: THESE ARE LARGE FILES! 21 Data splitter.p y Training Data Validation Data decision_t ree.py model.pkl evaluation.py metrics.json download_d ata.py ©ThoughtWorks 2019
  22. 22. CHALLENGE 2: AD-HOC MULTI-STEP PROCESS 22 Data splitter.p y Training Data Validation Data decision_t ree.py model.pkl evaluation.py metrics.json download_d ata.py ©ThoughtWorks 2019
  23. 23. ● dvc is git porcelain for storing large files using cloud storage ● dvc connects model training steps to create reproducible workflows SOLUTION: dvc data science version control 23 master change-max-depth try-random-forest model.pkl decision_tree.py model.pkl.dvc ©ThoughtWorks 2019
  24. 24. HOW DO WE TRACK EXPERIMENTS? ● Which experiments and hypothesis are being explored? ● Which algorithms are being used in each experiment? ● Which version of the code was used? ● How long does it take to run each experiment? ● What parameter and hyperparameters were used? ● How fast are my models learning? ● How do we compare results from different runs? We need to track the scientific process and evaluate our models: 24 ©ThoughtWorks 2019 24
  25. 25. 25 ©ThoughtWorks 2019 An Open Source platform for managing end-to-end machine learning lifecycle
  26. 26. DEPLOYMENT PIPELINE 26 ©ThoughtWorks 2019 26
  27. 27. DEPLOYMENT PIPELINE Automates the process of building, testing, and deploying applications to production 27 Application code in version control repository Container image as deployment artifact Deploy container to production servers ©ThoughtWorks 2019
  28. 28. 28 ©ThoughtWorks 2019 An Open Source Continuous Delivery server to model and visualise complex workflows
  29. 29. Pipeline Group ANATOMY OF A GOCD PIPELINE 29 ©ThoughtWorks 2019
  30. 30. MODEL MONITORING 30 ©ThoughtWorks 2019 30
  31. 31. HOW TO LEARN CONTINUOUSLY? ● Track model usage ● Track model inputs ● Track model outputs to identify potential bias or overfit ● Track model fairness to understand how it behaves against dimensions that could introduce unfair bias We need to capture production data to improve our models 31 ©ThoughtWorks 2019 31
  32. 32. EFK STACK Monitoring and Observability infrastructure 32 Open Source data collector for unified logging Open Source Search Engine Open Source web UI to explore and visualise data ©ThoughtWorks 2019
  33. 33. 33 ©ThoughtWorks 2019 An Open Source UI that makes it easy to explore and visualise the data index in Elasticsearch
  34. 34. 34 REAL WORLD PROJECT EXAMPLES ©ThoughtWorks 2019
  35. 35. 35 PRICE ESTIMATION ©ThoughtWorks 2019
  36. 36. 36 Chatbot Platform ©ThoughtWorks 2019
  37. 37. 37 Workshop on the Strata Conference in London (April 30, 2019) ©ThoughtWorks 2019 • 3h hands-on workshop • Run the a real-world scenario on participants laptop • Cloud Infrastructure (GCP, Kubernetes, GoCD Server) was prepared
  38. 38. 38 SUMMARY - WHAT HAVE WE LEARNED? ©ThoughtWorks 2019
  39. 39. CD4ML ● Proper data/model/code versioning tools enable reproducible work to be done in parallel. ● We can put data science work into a Continuous Delivery (CD) workflow. ● Result: Continuous, on-demand AI development and deployment, from research to production, with a single command. ● Benefit: production AI systems that are always as smart as your data science team. 39 ©ThoughtWorks 2019
  40. 40. 4040 THANK YOU! Christoph Windheuser Global Head of Artificial Intelligence ThoughtWorks Inc. (cwindheu@thoughtworks.com) ©ThoughtWorks 2019 join.thoughtworks.com

×