Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Applying GitOps and Progressive Delivery to Machine Learning

191 views

Published on

Machine Learning Ops (MLOps) is an area developing for the specific needs of machine learning. Especially when 6- to 7-figure dollar amounts and jobs can be at risk if an error occurs, you need a GitOps methodology and a way to leverage technologies such as service meshes. These will help you to update and test models faster and more frequently, while being able to make changes/rollback in a heavy, services-based architecture can cause unintended effects through the rest of the system. You need to control the blast radius of negative impacts and release the new models incrementally. This is known as “progressive delivery,” which includes strategies such as canarying, A/B testing, and incremental blue-green deployments.

Paul Curtis, Principal Solutions Architect at Weaveworks, will cover GitOps, Progressive Delivery that leverages service meshes, and their application to people with MLOps needs and concerns.

Benefits to the ecosystem:
Using the power of service meshes for machine learning ops is still fairly new, let alone applying GitOps and progressive delivery methodologies for reliability, lower risk, and control. We hope to bring useful tools and approaches to the right audiences looking to leverage service meshes to lower the risks of specific verticals.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Applying GitOps and Progressive Delivery to Machine Learning

  1. 1. GitOps Based Progressive Delivery in Kubernetes Paul Curtis, Weaveworks 1
  2. 2. ● A combination of ingress controller and routing daemon ● Intelligent routing of network traffic ● Provides networking paths between pods and namespaces ● Load balancing, traffic switching ● Dynamic and programmable control plane ● Istio, linkerd, Envoy, App Mesh (EKS) Everybody Good on Service Meshes? 2
  3. 3. ● Controlled changes to production services, rather than the “Frankenstein Switch” ● Different methods depending on requirements ○ A/B (traffic routing, HTTP/Cookies) ○ Blue/Green (traffic switch) ○ Canary (progressive traffic shifting) ● Machine learning production models will most likely benefit from canary ● Most common tools: Argo (pre-prod) and Seldon (QA/prod) Everybody Good on Progressive Delivery? 3
  4. 4. ● Any developer can use Git ● Issues, reviews, pull requests - the same workflow ● Anyone can join the team and ship a new app or make changes ● All changes can be triggered, stored, validated and audited in Git ● Make ops changes by pull request including rollbacks ● Observability and monitoring of operational changes What adopting GitOps means
  5. 5. ML Pipeline: Pre Production (Incredibly Simplified) 5 Data Sets Models Try different models. Change and test different parameters Examine results
  6. 6. GitOps Pre-Prod 6
  7. 7. WHY ARE WE TALKING ABOUT PRE PRODUCTION? 7
  8. 8. ● Every iteration of testing is tracked, audited, and reproducible ● Using standard Kubernetes objects like secrets and config maps allow changes to runs, but minimal changes to code ● Once the model run is accepted, everything is now in Git ● Argo CD (already does this), Seldon, and Kubeflow can all use this methodology Because If Everything is in Git ... 8
  9. 9. WITH EVERYTHING IN GIT, WE ARE ALL SET UP FOR PROGRESSIVE DELIVERY 9
  10. 10. Why Use Canary Deployment? 10
  11. 11. GitOps pipeline
  12. 12. How Flagger Works -- Deployment 12 canaryAnalysis: # schedule interval (default 60s) interval: 1m # max number of failed metric checks before rollback threshold: 10 # max traffic percentage routed to canary # percentage (0-100) maxWeight: 50 # canary increment step # percentage (0-100) stepWeight: 5
  13. 13. How Flagger Works -- Triggers and Metrics 13 metrics: - name: request-success-rate # minimum req success rate (non 5xx responses) # percentage (0-100) threshold: 99 interval: 1m - name: request-duration # maximum req duration P99 # milliseconds threshold: 500 interval: 30s

×