SlideShare a Scribd company logo
Why is DevOps for Machine Learning
so Different?
London DevOps Oct ‘19
Ryan Dawson
Outline
- Data Science vs Programming
- A Traditional Programming E2E Workflow
- Intro to ML E2E Workflows
- Detailed ML DevOps Topics
- Training
- Serving
- Monitoring
- Advanced ML DevOps Challenges
- Review
DevOps Background
DevOps roles centred on CI/CD and infra
Established tools
Key enabler for projects - time to value & governance
MLOps Background
87% of ML projects never go live
ML-related infrastructure is complex
Rise of ‘MLOps’
Why So Different?
Running software performs actions in response to inputs.
Traditional programming codifies actions as explicit rules
ML does not codify explicitly. Instead rules are indirectly set by capturing
patterns from data.
Different problem domains - ML more applicable to focused numerical problems.
Examples
Traditional Programming
● Old terminal systems through to
games
● Start with hello-world add control
structures
Data Science
● Classification problems,
regression problems
● Start with mnist or kaggle
ML Problem Examples
Regression:
- Predict salary from experience, education, location, etc.
- Predict sales from advertising spend, type of adverts, placement, etc.
Classification:
- Hand-writing samples for numbers - which number is it?
- Image classification - cat or not cat?
Data Playgrounds/Exploration
Data science is
exploratory
Interactive notebooks -
great for exploration
and visualization
ML code shared
through notebooks -
model can be an
artifact
Regression
fitting
Gradient Descent
Compute error against training data
Adjust weights and recompute
Key Points on ML
Training data and code together drive fitting
Closest thing to executable is a trained/weighted model (can vary with toolkit)
Retraining can be necessary (e.g. online shop and fashion trends)
Lots of data, long-running jobs
Traditional Programming Workflow
1. User Story
2. Write code
3. Submit PR
4. Tests run automatically
5. Review and merge
6. New version builds
7. Built executable deployed to environment
8. Further tests
9. Promote to next environment
10. More tests etc.
11. PROD
12. Monitor - stacktraces or error codes
Docker as packaging
Driver is a code change (git)
ML Workflows - Primer
Driver might be a code change. Or new data.
Data not in git.
More experimental - data-driven and you’ve only a sample of data.
Testing for quantifiable performance, not pass/fail.
Let’s focus on offline learning to simplify.
ML E2E Workflow Intro
1. Data inputs and outputs. Preprocessed. Large.
2. Try stuff locally with a slice.
3. Try with more data as long-running experiments.
4. Collaboration - often in jupyter & git
5. Model may be pickled
6. Integrate into a running app e.g. add REST API (serving)
7. Integration test with app.
8. Monitor performance metrics
Metrics Example
Online store example
A/B test
B leads to more conversions
But…
More negative reviews? Bounce-rate? Interaction-level? Latency?
What Can Happen
Role of MLOps
Empower teams and break down silos
Provide ways to collaborate/self-serve
New Territory
Special challenges for ML.
No clear standards yet. We’ll drill into:
1. Training - slice of data, train a weighted model to make predictions on unseen
data.
2. Serving - call with HTTP.
3. Rollout and Monitoring - making sure it performs.
1 Training/Experimentation
For long-running, intensive training jobs there’s kubeflow pipelines, polyaxon,
mlflow…
Broken into steps incl. cleaning and transformation (pre-processing).
Model Training
Each step can be long-running
Kubeflow - an ML platform
Kubeflow Pipelines
Parameterised experiments
MLFlow Experiments
Training and CI
Some training platforms have CI integration.
Result of a run could be a model. So analogous to a CI build of an executable.
But how to say that the new version is ‘good’?
2 Serving
Serving = use model via HTTP. Offline/batch is different.
Some platforms have serving or there’s dedicated solutions. Seldon, Tensorflow
Serving, AzureML, SageMaker
Often package the model and host (bucket) so the serving solution can run it.
Serving can support rollout & monitoring.
Comparison: k8s hello world
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello-world
spec:
selector:
matchLabels:
run: load-balancer-example
replicas: 2
template:
metadata:
labels:
run: load-balancer-example
spec:
containers:
- name: hello-world
image: gcr.io/google-samples/node-hello:1.0
ports:
- containerPort: 8080
protocol: TCP
K8s Dep using docker
Hand-craft Service spec
Seldon ML Serving apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: sklearn
spec:
name: iris
predictors:
- graph:
children: []
implementation: SKLEARN_SERVER
modelUri: gs://seldon-models/sklearn/iris
name: classifier
name: default
replicas: 1
K8s custom resource
Pods created to serve http
Docker option too
Data scientists like pickles
3 Rollout and Monitoring
ML model trained on a sample - need to check and keep checking against new
data coming in. Rollout strategies:
Canary = % of traffic to new version as check
A/B Test = % split between versions for longer to monitor performance
Shadowing = All traffic to old and new model. Only the live model’s responses are
used
Canary with Seldon
kind: SeldonDeployment
apiVersion: machinelearning.seldon.io/v1alpha2
metadata:
name: skiris
namespace: default
creationTimestamp:
spec:
name: skiris
predictors:
- name: default
graph:
name: skiris-default
implementation: SKLEARN_SERVER
modelUri: gs://seldon-models/sklearn/iris
replicas: 1
- name: canary
graph:
name: skiris-canary
implementation: XGBOOST_SERVER
modelUri: gs://seldon-models/xgboost/iris
replicas: 1
Traffic-splitting more typically defined
in gateway config.
Very common in ML.
In serving not gateway so data
scientist can define rollout.
A/B Test with Seldon
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: mlflow-deployment
spec:
name: mlflow-deployment
predictors:
- graph:
children: []
implementation: MLFLOW_SERVER
modelUri: gs://seldon-models/mlflow/elasticnet_wine
name: wines-classifier
name: a-mlflow-deployment-dag
replicas: 1
traffic: 20
- graph:
children: []
implementation: MLFLOW_SERVER
modelUri: gs://seldon-models/mlflow/elasticnet_wine
name: wines-classifier
name: b-mlflow-deployment-dag
replicas: 1
traffic: 80
Seldon Metrics
Out of the box basic metrics (because so commonly needed)
Seldon Request Logging
Human review of predictions can be needed
Seldon UI
Rollout, serving and monitoring
Advanced Topics - Serving
● Real-time inference graphs with pre-processing
● Advanced routing - multi-armed bandits.
● Outlier detection
● Concept drift
Advanced Topics - Governance
● Explainability - why did it predict that?
○ Some orgs sticking to whitebox techniques - not neural nets
○ Blackbox is possible
● Reproducibility - tracking and metadata (associating models to training runs to
data to triggers)
○ Data versioning adds complexity
○ Competing tools for metadata
○ No agreed standards yet
● Bias & ethics
● Adversarial attacks
Summary
MLOps is new terrain.
ML is data-driven. MLOps enables with:
● Data and compute-intensive experiments and training
● Artifact tracking
● Monitoring tools
● Rollout strategies to work with monitoring

More Related Content

Similar to Why is dev ops for machine learning so different

Why is dev ops for machine learning so different - dataxdays
Why is dev ops for machine learning so different  - dataxdaysWhy is dev ops for machine learning so different  - dataxdays
Why is dev ops for machine learning so different - dataxdays
Ryan Dawson
 
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAdvanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Animesh Singh
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Pitfalls of machine learning in production
Pitfalls of machine learning in productionPitfalls of machine learning in production
Pitfalls of machine learning in production
Antoine Sauray
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with Databricks
Liangjun Jiang
 
Magdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningMagdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine Learning
Lviv Startup Club
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Databricks
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Databricks
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
Databricks
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
Provectus
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Sotrender
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Databricks
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
Stepan Pushkarev
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
All Things Open
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
dtz001
 
Practical machine learning
Practical machine learningPractical machine learning
Practical machine learning
Faizan Javed
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
Databricks
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
James Anderson
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Strata CA 2019: From Jupyter to Production Manu Mukerji
Strata CA 2019: From Jupyter to Production Manu MukerjiStrata CA 2019: From Jupyter to Production Manu Mukerji
Strata CA 2019: From Jupyter to Production Manu Mukerji
Manu Mukerji
 

Similar to Why is dev ops for machine learning so different (20)

Why is dev ops for machine learning so different - dataxdays
Why is dev ops for machine learning so different  - dataxdaysWhy is dev ops for machine learning so different  - dataxdays
Why is dev ops for machine learning so different - dataxdays
 
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAdvanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Pitfalls of machine learning in production
Pitfalls of machine learning in productionPitfalls of machine learning in production
Pitfalls of machine learning in production
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with Databricks
 
Magdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningMagdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine Learning
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
 
Practical machine learning
Practical machine learningPractical machine learning
Practical machine learning
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Strata CA 2019: From Jupyter to Production Manu Mukerji
Strata CA 2019: From Jupyter to Production Manu MukerjiStrata CA 2019: From Jupyter to Production Manu Mukerji
Strata CA 2019: From Jupyter to Production Manu Mukerji
 

More from Ryan Dawson

mlops.community meetup - ML Governance_ A Practical Guide.pptx
mlops.community meetup - ML Governance_ A Practical Guide.pptxmlops.community meetup - ML Governance_ A Practical Guide.pptx
mlops.community meetup - ML Governance_ A Practical Guide.pptx
Ryan Dawson
 
Conspiracy Theories in the Information Age
Conspiracy Theories in the Information AgeConspiracy Theories in the Information Age
Conspiracy Theories in the Information Age
Ryan Dawson
 
Maximising teamwork in delivering software products
Maximising teamwork in delivering software productsMaximising teamwork in delivering software products
Maximising teamwork in delivering software products
Ryan Dawson
 
Maximising teamwork in delivering software products
Maximising teamwork in delivering software products Maximising teamwork in delivering software products
Maximising teamwork in delivering software products
Ryan Dawson
 
Java vs challenger languages
Java vs challenger languagesJava vs challenger languages
Java vs challenger languages
Ryan Dawson
 
Challenges for AI in prod
Challenges for AI in prodChallenges for AI in prod
Challenges for AI in prod
Ryan Dawson
 
From training to explainability via git ops
From training to explainability via git opsFrom training to explainability via git ops
From training to explainability via git ops
Ryan Dawson
 
How open source is funded the enterprise differentiation tightrope (1)
How open source is funded  the enterprise differentiation tightrope (1)How open source is funded  the enterprise differentiation tightrope (1)
How open source is funded the enterprise differentiation tightrope (1)
Ryan Dawson
 
From java monolith to kubernetes microservices - an open source journey with ...
From java monolith to kubernetes microservices - an open source journey with ...From java monolith to kubernetes microservices - an open source journey with ...
From java monolith to kubernetes microservices - an open source journey with ...
Ryan Dawson
 
Whirlwind tour of activiti 7
Whirlwind tour of activiti 7Whirlwind tour of activiti 7
Whirlwind tour of activiti 7
Ryan Dawson
 
Jdk.io cloud native business automation
Jdk.io cloud native business automationJdk.io cloud native business automation
Jdk.io cloud native business automation
Ryan Dawson
 
Identity management and single sign on - how much flexibility
Identity management and single sign on - how much flexibilityIdentity management and single sign on - how much flexibility
Identity management and single sign on - how much flexibility
Ryan Dawson
 
Activiti Cloud Deep Dive
Activiti Cloud Deep DiveActiviti Cloud Deep Dive
Activiti Cloud Deep Dive
Ryan Dawson
 

More from Ryan Dawson (13)

mlops.community meetup - ML Governance_ A Practical Guide.pptx
mlops.community meetup - ML Governance_ A Practical Guide.pptxmlops.community meetup - ML Governance_ A Practical Guide.pptx
mlops.community meetup - ML Governance_ A Practical Guide.pptx
 
Conspiracy Theories in the Information Age
Conspiracy Theories in the Information AgeConspiracy Theories in the Information Age
Conspiracy Theories in the Information Age
 
Maximising teamwork in delivering software products
Maximising teamwork in delivering software productsMaximising teamwork in delivering software products
Maximising teamwork in delivering software products
 
Maximising teamwork in delivering software products
Maximising teamwork in delivering software products Maximising teamwork in delivering software products
Maximising teamwork in delivering software products
 
Java vs challenger languages
Java vs challenger languagesJava vs challenger languages
Java vs challenger languages
 
Challenges for AI in prod
Challenges for AI in prodChallenges for AI in prod
Challenges for AI in prod
 
From training to explainability via git ops
From training to explainability via git opsFrom training to explainability via git ops
From training to explainability via git ops
 
How open source is funded the enterprise differentiation tightrope (1)
How open source is funded  the enterprise differentiation tightrope (1)How open source is funded  the enterprise differentiation tightrope (1)
How open source is funded the enterprise differentiation tightrope (1)
 
From java monolith to kubernetes microservices - an open source journey with ...
From java monolith to kubernetes microservices - an open source journey with ...From java monolith to kubernetes microservices - an open source journey with ...
From java monolith to kubernetes microservices - an open source journey with ...
 
Whirlwind tour of activiti 7
Whirlwind tour of activiti 7Whirlwind tour of activiti 7
Whirlwind tour of activiti 7
 
Jdk.io cloud native business automation
Jdk.io cloud native business automationJdk.io cloud native business automation
Jdk.io cloud native business automation
 
Identity management and single sign on - how much flexibility
Identity management and single sign on - how much flexibilityIdentity management and single sign on - how much flexibility
Identity management and single sign on - how much flexibility
 
Activiti Cloud Deep Dive
Activiti Cloud Deep DiveActiviti Cloud Deep Dive
Activiti Cloud Deep Dive
 

Recently uploaded

Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
NaapbooksPrivateLimi
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 

Recently uploaded (20)

Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 

Why is dev ops for machine learning so different

  • 1. Why is DevOps for Machine Learning so Different? London DevOps Oct ‘19 Ryan Dawson
  • 2. Outline - Data Science vs Programming - A Traditional Programming E2E Workflow - Intro to ML E2E Workflows - Detailed ML DevOps Topics - Training - Serving - Monitoring - Advanced ML DevOps Challenges - Review
  • 3. DevOps Background DevOps roles centred on CI/CD and infra Established tools Key enabler for projects - time to value & governance
  • 4. MLOps Background 87% of ML projects never go live ML-related infrastructure is complex Rise of ‘MLOps’
  • 5. Why So Different? Running software performs actions in response to inputs. Traditional programming codifies actions as explicit rules ML does not codify explicitly. Instead rules are indirectly set by capturing patterns from data. Different problem domains - ML more applicable to focused numerical problems.
  • 6. Examples Traditional Programming ● Old terminal systems through to games ● Start with hello-world add control structures Data Science ● Classification problems, regression problems ● Start with mnist or kaggle
  • 7. ML Problem Examples Regression: - Predict salary from experience, education, location, etc. - Predict sales from advertising spend, type of adverts, placement, etc. Classification: - Hand-writing samples for numbers - which number is it? - Image classification - cat or not cat?
  • 8. Data Playgrounds/Exploration Data science is exploratory Interactive notebooks - great for exploration and visualization ML code shared through notebooks - model can be an artifact
  • 10. Gradient Descent Compute error against training data Adjust weights and recompute
  • 11. Key Points on ML Training data and code together drive fitting Closest thing to executable is a trained/weighted model (can vary with toolkit) Retraining can be necessary (e.g. online shop and fashion trends) Lots of data, long-running jobs
  • 12. Traditional Programming Workflow 1. User Story 2. Write code 3. Submit PR 4. Tests run automatically 5. Review and merge 6. New version builds 7. Built executable deployed to environment 8. Further tests 9. Promote to next environment 10. More tests etc. 11. PROD 12. Monitor - stacktraces or error codes Docker as packaging Driver is a code change (git)
  • 13. ML Workflows - Primer Driver might be a code change. Or new data. Data not in git. More experimental - data-driven and you’ve only a sample of data. Testing for quantifiable performance, not pass/fail. Let’s focus on offline learning to simplify.
  • 14. ML E2E Workflow Intro 1. Data inputs and outputs. Preprocessed. Large. 2. Try stuff locally with a slice. 3. Try with more data as long-running experiments. 4. Collaboration - often in jupyter & git 5. Model may be pickled 6. Integrate into a running app e.g. add REST API (serving) 7. Integration test with app. 8. Monitor performance metrics
  • 15. Metrics Example Online store example A/B test B leads to more conversions But… More negative reviews? Bounce-rate? Interaction-level? Latency?
  • 17. Role of MLOps Empower teams and break down silos Provide ways to collaborate/self-serve
  • 18. New Territory Special challenges for ML. No clear standards yet. We’ll drill into: 1. Training - slice of data, train a weighted model to make predictions on unseen data. 2. Serving - call with HTTP. 3. Rollout and Monitoring - making sure it performs.
  • 19. 1 Training/Experimentation For long-running, intensive training jobs there’s kubeflow pipelines, polyaxon, mlflow… Broken into steps incl. cleaning and transformation (pre-processing).
  • 20. Model Training Each step can be long-running
  • 21. Kubeflow - an ML platform
  • 24. Training and CI Some training platforms have CI integration. Result of a run could be a model. So analogous to a CI build of an executable. But how to say that the new version is ‘good’?
  • 25. 2 Serving Serving = use model via HTTP. Offline/batch is different. Some platforms have serving or there’s dedicated solutions. Seldon, Tensorflow Serving, AzureML, SageMaker Often package the model and host (bucket) so the serving solution can run it. Serving can support rollout & monitoring.
  • 26. Comparison: k8s hello world apiVersion: apps/v1 kind: Deployment metadata: name: hello-world spec: selector: matchLabels: run: load-balancer-example replicas: 2 template: metadata: labels: run: load-balancer-example spec: containers: - name: hello-world image: gcr.io/google-samples/node-hello:1.0 ports: - containerPort: 8080 protocol: TCP K8s Dep using docker Hand-craft Service spec
  • 27. Seldon ML Serving apiVersion: machinelearning.seldon.io/v1alpha2 kind: SeldonDeployment metadata: name: sklearn spec: name: iris predictors: - graph: children: [] implementation: SKLEARN_SERVER modelUri: gs://seldon-models/sklearn/iris name: classifier name: default replicas: 1 K8s custom resource Pods created to serve http Docker option too Data scientists like pickles
  • 28. 3 Rollout and Monitoring ML model trained on a sample - need to check and keep checking against new data coming in. Rollout strategies: Canary = % of traffic to new version as check A/B Test = % split between versions for longer to monitor performance Shadowing = All traffic to old and new model. Only the live model’s responses are used
  • 29. Canary with Seldon kind: SeldonDeployment apiVersion: machinelearning.seldon.io/v1alpha2 metadata: name: skiris namespace: default creationTimestamp: spec: name: skiris predictors: - name: default graph: name: skiris-default implementation: SKLEARN_SERVER modelUri: gs://seldon-models/sklearn/iris replicas: 1 - name: canary graph: name: skiris-canary implementation: XGBOOST_SERVER modelUri: gs://seldon-models/xgboost/iris replicas: 1 Traffic-splitting more typically defined in gateway config. Very common in ML. In serving not gateway so data scientist can define rollout.
  • 30. A/B Test with Seldon apiVersion: machinelearning.seldon.io/v1alpha2 kind: SeldonDeployment metadata: name: mlflow-deployment spec: name: mlflow-deployment predictors: - graph: children: [] implementation: MLFLOW_SERVER modelUri: gs://seldon-models/mlflow/elasticnet_wine name: wines-classifier name: a-mlflow-deployment-dag replicas: 1 traffic: 20 - graph: children: [] implementation: MLFLOW_SERVER modelUri: gs://seldon-models/mlflow/elasticnet_wine name: wines-classifier name: b-mlflow-deployment-dag replicas: 1 traffic: 80
  • 31. Seldon Metrics Out of the box basic metrics (because so commonly needed)
  • 32. Seldon Request Logging Human review of predictions can be needed
  • 33. Seldon UI Rollout, serving and monitoring
  • 34. Advanced Topics - Serving ● Real-time inference graphs with pre-processing ● Advanced routing - multi-armed bandits. ● Outlier detection ● Concept drift
  • 35. Advanced Topics - Governance ● Explainability - why did it predict that? ○ Some orgs sticking to whitebox techniques - not neural nets ○ Blackbox is possible ● Reproducibility - tracking and metadata (associating models to training runs to data to triggers) ○ Data versioning adds complexity ○ Competing tools for metadata ○ No agreed standards yet ● Bias & ethics ● Adversarial attacks
  • 36. Summary MLOps is new terrain. ML is data-driven. MLOps enables with: ● Data and compute-intensive experiments and training ● Artifact tracking ● Monitoring tools ● Rollout strategies to work with monitoring

Editor's Notes

  1. Risk that data coming in will diverge significantly from the sample taken for training. This is concept drift.