SlideShare a Scribd company logo
1 of 38
Download to read offline
Cloud-Native MLOps
Framework
Data Fest 2021
Artem Koval
Big Data and Machine Learning Practice Lead at ClearScale
About Speaker
● Hey all!
● Name: Artem Koval
● Position: Big Data and Machine Learning
Practice Lead
● Company: ClearScale
Agenda
● What is modern MLOps
● Why the shift towards Human-Centered AI
● Fairness, Explainability, Model Monitoring
● Human Augmented AI
● How much MLOps do you need in your organization
● The future
What is MLOps?
● https://en.wikipedia.org/wiki/MLOps
● https://ml-ops.org/
● A process of deploying ML models in CI/CD manner into production,
establishing model monitoring, explainability, fairness, and providing tools for
human intervention
CRISP-DM
● https://en.wikipedia.org
/wiki/Cross-industry_st
andard_process_for_d
ata_mining
● Too generic
Why a Framework
● A need for the end-to-end solution, from the data ingestion to the model
monitoring, data labeling, algorithms explainability
An elegant weapon for a more civilized age (c)
● Your father’s ML Pipeline
ML has Technical Debt?
● Hidden Debt in Machine Learning Systems
The House of MLOps
Human-Centered AI
● https://hai.stanford.edu/
● https://plato.stanford.edu/entries/ethics-ai/
● https://ethical.institute/
● Humans must control AI end-to-end solutions
Cloud-Native MLOps
Modern MLOps Framework Drivers
● Not only CI/CD and ML code anymore
● Fairness and Explainability
● Observability (Monitoring)
● Scalability (Training and Inference)
● Data Labeling
● A/B Testing, Acceptance Testing
● Human Review
● Legacy Migration
● Multi-tenant Multi-model
Fairness
● https://github.com/slundberg/shap
● Regulatory requirements
● Business trust
Explainability
● https://github.com/Trusted-AI/AIF360
● No bias in data, no bias in inference (gender, racial, religious, ageism etc.)
● Fairness and Explainability by Design as a Process
Monitor Data Quality
● Monitors ML models in production and notifies when data quality issue arise
● Enable data capture (inference input & output, historical data)
● Create a baseline (https://github.com/awslabs/deequ)
● Define and schedule data quality monitoring jobs
● View data quality metrics/violations
● Integrate data quality monitoring with a Notification Service
● Interpret the results of a monitoring job
● Visualize results
Data Quality Violations/Metrics
● data_type_check
● completeness_check
● baseline_drift_check
● missing_column_check
● extra_column_check
● categorical_values_check
● Max, Min, Sum, SampleCount, Average, Distribution, StdDev, Mean
● ...
Monitor Model Quality
● Monitors the performance of a model by comparing the live predictions with
the actual ground truth labels
● Enable Data Capture
● Create a baseline
● Define and schedule model quality monitoring jobs
● Ingest ground truth labels that model monitor merges with captured prediction
data from real-time/batch inference endpoints
● Integrate model quality monitoring with a Notification Service
● Interpret the results of a monitoring job
● Visualize the results
Model Quality Metrics
● Regression: mae, mse, rmse, r2, ...
● Binary classification: confusion matrix, recall, precision, accuracy,
recall_best_constant_classifier, precision_best_constant_classifier,
accuracy_best_constant_classifier, true_positive_rate, …
● Multiclass classification: weighted_recall, weighted_f1,
weighted_f2_best_constant_classifier, ...
Monitor Bias Drift
● https://github.com/aws/amazon-sagemaker-clarify
● https://github.com/anodot/MLWatcher
● Training data differs from the live inference data
● Pre-training/post-training/common
Bias Metrics
● Class Imbalance (CI)
● Difference in Positive Proportions in Labels (DPL)
● Kullback-Liebler Divergence (KL)
● Jensen-Shannon Divergence (JS)
● Total Variation Distance (TVD)
● Kolmogorov-Smirnov Distance (KS)
● Conditional Demographic Disparity in Labels (CDDL)
● Difference in Conditional Outcomes (DCO)
● Difference in Label Rates (DLR)
● ...
Monitor Feature Attribution Drift
● https://github.com/slundberg/shap
● A drift in the distribution of live data for models in production can result in a
corresponding drift in the feature attribution values
Feature Attribution Drift Monitoring Methods
● LIME
● Shapley sampling values
● DeepLIFT
● QII
● Layer-wise relevance propagation
● Shapley regression values
● Tree interpreter
Human Augmented AI Drivers
● Need human oversight to ensure accuracy with sensitive data (healthcare,
finance)
● Implement human review of ML predictions
● Integrate human oversight with any application
● Flexibility to work with inside and outside reviewers
● Easy instructions for reviewers
● Workflows to simplify the human review process
● Improve results with multiple reviews
Human Augmented AI
Human Augmented Ground Truth Labeling Drivers
● Improve data label accuracy
● Easy to use (automatic snapping, image denoising, pre-selecting object
contour, etc.)
● Reduce costs
● Distribute workload over varying workforce
Human Augmented Ground Truth Labeling
MLOps Levels
● Lightweight MLOps
● Cloud-Native Greenfield SMB MLOps
● Enterprise MLOps
● Human-Centered AI
Lightweight MLOps Capacity
● One-person data science shop
● Small number of models (1-3)
● ML system is a greenfield
● Need to run time-critical demo for a small audience
● Models are custom, lightweight, don’t require compute-intensive model
training/HPO
● Low traffic is expected
Lightweight MLOps Solution Blueprint
● Convert models with TensorFlow Lite (or other framework-specific strip-down)
● Write a simple API microservice (e.g., Flask)
● Deploy as is, with no containerization, as a web app calling ML layer
● Use CPU-based commodity cloud instances
● Minimal model monitoring to at least capture drift
● Data analysis, feature engineering, orchestration, CI/CD, acceptance/AB
testing might be omitted
● Bootstrapping is highly needed to organize the process (e.g. Metaflow)
Cloud-Native SMB MLOps Capacity
● Have engineering resource
● Custom proprietary algorithms
● ML system is a greenfield
● Model development requires advance comput for training/HPO and inference
● Multi-model, multi tenant setup is needed
Cloud-Native SMB MLOps Solution Blueprint
● Containerize models (Docker)
● Utilize framework/cloud vendor specific HPO approaches
● Use GPU-based commodity cloud instances when needed
● Use cloud vendor specific elastic inference approaches
● Abstract and isolate data analysis, feature engineering, model training and
other steps
● Orchestrate with Apache Airflow or similar technology agnostic tools
● Ensure multi-tenancy by logical isolation of ML Workflows
● Implement model monitoring at least partially (bias drift, model quality)
Enterprise MLOps Capacity
● Have legacy ML system with a lot of microservices, models, orchestration
flows
● Have highly custom proprietary libraries requiring complex make
● Have advanced tenant isolation requirements
● Have a lot of models (>10)
● Have advanced needs for a large data science team to collaborate
Enterprise MLOps Blueprint
● Serve dockerized models with Kubeflow in a Kubernetes cluster
● Use Kubeflow tenancy isolation
● Use KFServing to deploy multiple variants of multiple models
● Use Katib for HPO
● Use Prometheus + Grafana, ELK for the full model monitoring, consuming
metrics with the open-source empowered microservices (SHAP, etc.)
● Implement advanced production acceptance testing (e.g., Differential Testing,
Shadow Deployments, Integration Testing etc.)
● Built custom human augmented Review/Labeling tools
Human-Centered AI Blueprint
● Can be added at any size/project configuration
● Ideally should be incorporated as a process touch all steps (data analysis,
training, deployment, monitoring)
● Remember: the moment your model is deployed to production it’s already
obsolete. Build with the CI/CD and human operations review in mind
The future
● Privacy-Preserving Machine Learning (differential, compressive, etc.)
● Models interpretability (global, local, saliency mapping, semantic similarity
etc.)
● Model Monitoring in AutoML (AutoKeras/Keras Tuner + SHAP, etc.)
● Measuring human augmentation (uncertainty/diversity sampling, active
learning, quality control, annotation/augmentation quality metrics, etc.)
Thanks everyone!
Questions?
The End
You could reach me via mail@artemkoval.com or LinkedIn
May the MLOps be with you!
Extra Resources
● https://github.com/EthicalML/awesome-production-machine-learning
● https://aws.amazon.com/solutions/implementations/aws-mlops-framework/
● https://cloud.google.com/architecture/mlops-continuous-delivery-and-automati
on-pipelines-in-machine-learning
● https://azure.microsoft.com/en-us/services/machine-learning/mlops/
● https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
● https://github.com/visenger/awesome-mlops#mlops-books

More Related Content

Similar to Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf

Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...DataScienceConferenc1
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprisedoppenhe
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLMárton Kodok
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)Senturus
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOpsRui Quintino
 
MLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxMLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxKnoldus Inc.
 
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...DataScienceConferenc1
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro sessionAvinash Patil
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorchgeetachauhan
 
Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Tushar Katarki
 

Similar to Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf (20)

Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
 
MLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxMLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptx
 
Aws autopilot
Aws autopilotAws autopilot
Aws autopilot
 
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
 
Machine learning
Machine learningMachine learning
Machine learning
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
DevOps Days Rockies MLOps
DevOps Days Rockies MLOpsDevOps Days Rockies MLOps
DevOps Days Rockies MLOps
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro session
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
 
Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf

  • 1. Cloud-Native MLOps Framework Data Fest 2021 Artem Koval Big Data and Machine Learning Practice Lead at ClearScale
  • 2. About Speaker ● Hey all! ● Name: Artem Koval ● Position: Big Data and Machine Learning Practice Lead ● Company: ClearScale
  • 3. Agenda ● What is modern MLOps ● Why the shift towards Human-Centered AI ● Fairness, Explainability, Model Monitoring ● Human Augmented AI ● How much MLOps do you need in your organization ● The future
  • 4. What is MLOps? ● https://en.wikipedia.org/wiki/MLOps ● https://ml-ops.org/ ● A process of deploying ML models in CI/CD manner into production, establishing model monitoring, explainability, fairness, and providing tools for human intervention
  • 6. Why a Framework ● A need for the end-to-end solution, from the data ingestion to the model monitoring, data labeling, algorithms explainability
  • 7. An elegant weapon for a more civilized age (c) ● Your father’s ML Pipeline
  • 8. ML has Technical Debt? ● Hidden Debt in Machine Learning Systems
  • 9. The House of MLOps
  • 10. Human-Centered AI ● https://hai.stanford.edu/ ● https://plato.stanford.edu/entries/ethics-ai/ ● https://ethical.institute/ ● Humans must control AI end-to-end solutions
  • 12. Modern MLOps Framework Drivers ● Not only CI/CD and ML code anymore ● Fairness and Explainability ● Observability (Monitoring) ● Scalability (Training and Inference) ● Data Labeling ● A/B Testing, Acceptance Testing ● Human Review ● Legacy Migration ● Multi-tenant Multi-model
  • 14. Explainability ● https://github.com/Trusted-AI/AIF360 ● No bias in data, no bias in inference (gender, racial, religious, ageism etc.) ● Fairness and Explainability by Design as a Process
  • 15. Monitor Data Quality ● Monitors ML models in production and notifies when data quality issue arise ● Enable data capture (inference input & output, historical data) ● Create a baseline (https://github.com/awslabs/deequ) ● Define and schedule data quality monitoring jobs ● View data quality metrics/violations ● Integrate data quality monitoring with a Notification Service ● Interpret the results of a monitoring job ● Visualize results
  • 16. Data Quality Violations/Metrics ● data_type_check ● completeness_check ● baseline_drift_check ● missing_column_check ● extra_column_check ● categorical_values_check ● Max, Min, Sum, SampleCount, Average, Distribution, StdDev, Mean ● ...
  • 17. Monitor Model Quality ● Monitors the performance of a model by comparing the live predictions with the actual ground truth labels ● Enable Data Capture ● Create a baseline ● Define and schedule model quality monitoring jobs ● Ingest ground truth labels that model monitor merges with captured prediction data from real-time/batch inference endpoints ● Integrate model quality monitoring with a Notification Service ● Interpret the results of a monitoring job ● Visualize the results
  • 18. Model Quality Metrics ● Regression: mae, mse, rmse, r2, ... ● Binary classification: confusion matrix, recall, precision, accuracy, recall_best_constant_classifier, precision_best_constant_classifier, accuracy_best_constant_classifier, true_positive_rate, … ● Multiclass classification: weighted_recall, weighted_f1, weighted_f2_best_constant_classifier, ...
  • 19. Monitor Bias Drift ● https://github.com/aws/amazon-sagemaker-clarify ● https://github.com/anodot/MLWatcher ● Training data differs from the live inference data ● Pre-training/post-training/common
  • 20. Bias Metrics ● Class Imbalance (CI) ● Difference in Positive Proportions in Labels (DPL) ● Kullback-Liebler Divergence (KL) ● Jensen-Shannon Divergence (JS) ● Total Variation Distance (TVD) ● Kolmogorov-Smirnov Distance (KS) ● Conditional Demographic Disparity in Labels (CDDL) ● Difference in Conditional Outcomes (DCO) ● Difference in Label Rates (DLR) ● ...
  • 21. Monitor Feature Attribution Drift ● https://github.com/slundberg/shap ● A drift in the distribution of live data for models in production can result in a corresponding drift in the feature attribution values
  • 22. Feature Attribution Drift Monitoring Methods ● LIME ● Shapley sampling values ● DeepLIFT ● QII ● Layer-wise relevance propagation ● Shapley regression values ● Tree interpreter
  • 23. Human Augmented AI Drivers ● Need human oversight to ensure accuracy with sensitive data (healthcare, finance) ● Implement human review of ML predictions ● Integrate human oversight with any application ● Flexibility to work with inside and outside reviewers ● Easy instructions for reviewers ● Workflows to simplify the human review process ● Improve results with multiple reviews
  • 25. Human Augmented Ground Truth Labeling Drivers ● Improve data label accuracy ● Easy to use (automatic snapping, image denoising, pre-selecting object contour, etc.) ● Reduce costs ● Distribute workload over varying workforce
  • 26. Human Augmented Ground Truth Labeling
  • 27. MLOps Levels ● Lightweight MLOps ● Cloud-Native Greenfield SMB MLOps ● Enterprise MLOps ● Human-Centered AI
  • 28. Lightweight MLOps Capacity ● One-person data science shop ● Small number of models (1-3) ● ML system is a greenfield ● Need to run time-critical demo for a small audience ● Models are custom, lightweight, don’t require compute-intensive model training/HPO ● Low traffic is expected
  • 29. Lightweight MLOps Solution Blueprint ● Convert models with TensorFlow Lite (or other framework-specific strip-down) ● Write a simple API microservice (e.g., Flask) ● Deploy as is, with no containerization, as a web app calling ML layer ● Use CPU-based commodity cloud instances ● Minimal model monitoring to at least capture drift ● Data analysis, feature engineering, orchestration, CI/CD, acceptance/AB testing might be omitted ● Bootstrapping is highly needed to organize the process (e.g. Metaflow)
  • 30. Cloud-Native SMB MLOps Capacity ● Have engineering resource ● Custom proprietary algorithms ● ML system is a greenfield ● Model development requires advance comput for training/HPO and inference ● Multi-model, multi tenant setup is needed
  • 31. Cloud-Native SMB MLOps Solution Blueprint ● Containerize models (Docker) ● Utilize framework/cloud vendor specific HPO approaches ● Use GPU-based commodity cloud instances when needed ● Use cloud vendor specific elastic inference approaches ● Abstract and isolate data analysis, feature engineering, model training and other steps ● Orchestrate with Apache Airflow or similar technology agnostic tools ● Ensure multi-tenancy by logical isolation of ML Workflows ● Implement model monitoring at least partially (bias drift, model quality)
  • 32. Enterprise MLOps Capacity ● Have legacy ML system with a lot of microservices, models, orchestration flows ● Have highly custom proprietary libraries requiring complex make ● Have advanced tenant isolation requirements ● Have a lot of models (>10) ● Have advanced needs for a large data science team to collaborate
  • 33. Enterprise MLOps Blueprint ● Serve dockerized models with Kubeflow in a Kubernetes cluster ● Use Kubeflow tenancy isolation ● Use KFServing to deploy multiple variants of multiple models ● Use Katib for HPO ● Use Prometheus + Grafana, ELK for the full model monitoring, consuming metrics with the open-source empowered microservices (SHAP, etc.) ● Implement advanced production acceptance testing (e.g., Differential Testing, Shadow Deployments, Integration Testing etc.) ● Built custom human augmented Review/Labeling tools
  • 34. Human-Centered AI Blueprint ● Can be added at any size/project configuration ● Ideally should be incorporated as a process touch all steps (data analysis, training, deployment, monitoring) ● Remember: the moment your model is deployed to production it’s already obsolete. Build with the CI/CD and human operations review in mind
  • 35. The future ● Privacy-Preserving Machine Learning (differential, compressive, etc.) ● Models interpretability (global, local, saliency mapping, semantic similarity etc.) ● Model Monitoring in AutoML (AutoKeras/Keras Tuner + SHAP, etc.) ● Measuring human augmentation (uncertainty/diversity sampling, active learning, quality control, annotation/augmentation quality metrics, etc.)
  • 37. The End You could reach me via mail@artemkoval.com or LinkedIn May the MLOps be with you!
  • 38. Extra Resources ● https://github.com/EthicalML/awesome-production-machine-learning ● https://aws.amazon.com/solutions/implementations/aws-mlops-framework/ ● https://cloud.google.com/architecture/mlops-continuous-delivery-and-automati on-pipelines-in-machine-learning ● https://azure.microsoft.com/en-us/services/machine-learning/mlops/ ● https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html ● https://github.com/visenger/awesome-mlops#mlops-books