SlideShare a Scribd company logo
Monitoring AI with AI
Stepan Pushkarev
CTO of Hydrosphere.io
Mission: Accelerate Machine Learning to Production
Opensource Products:
- ML Lambda: ML Deployment and Serving
- Sonar: Data and ML Monitoring
- Mist: Serverless proxy for Spark
Business Model: PaaS and hands-on consulting
About
Traditional Software Machine Learning applications
Explicit business rules ML generated model
Unit testing Model Evaluation
(Micro)service Model as a Service
Docker per service Docker per Model
1 version of Microservice in prod 1-10-20 model versions in prod at a time
Eng + QA team owning a service 1 ML Engineer owning 10-20 models
Fail loudly (exception, stack trace) Fail silently
Can work forever if verified Performance declines over time
Needs continuous retraining / redeployment
App metrics monitoring Data Monitoring | Model Metrics Monitoring
Cost of an AI/ML Error
● Fun
© http://blog.ycombinator.com/how-adversarial-attacks-work/
● Fun
● Not fun
Cost of an AI Error
● Fun
● Not fun
● Not fun at all...
Cost of an AI Error
● Fun
● Not fun
● Not fun at all…
● Money
Cost of an AI Error
● Fun
● Not fun
● Not fun at all…
● Money
● Business
Cost of an AI Error
Where/why may AI fail in prod?
Where/why may AI fail in prod?
Everywhere!
Where/why may AI fail in prod?
● Bad training data
● Bad serving data
● Training/serving data skew
● Misconfiguration
● Deployment issue
● Retraining issue
● Performance
● Concept Drift
Everywhere!
AI Reliability Pyramid
Reliable Training-Serving pipelines
Comfort Zone for Data Scientist in the
middle of Production
AI Reliability Pyramid
Model Deployment and integration
model.pkl model.zip
How to integrate it into AI Application?
Model server = Model Artifact +
Metadata + Runtime + Deps + Sidecar
/predict
input:
string text;
bytes image;
output:
string summary;
JVM DL4j
GPU
matching_model v2
[
....
]
gRPC HTTP server
routing, shadowing
pipelining
tracing
metrics
autoscaling
A/B, canary
sidecar
serving
requests
Model Deployment takeaways
● Eliminates hand-off between Data Scientist -> ML Eng ->
Data Eng -> SA Eng -> QA -> Ops
● Sticks components together: Data + Model + Applications +
Automation = AI Application
● Enables quick transition from research to production. ML
engineers can deploy models many times a day
But wait… This is not safe!
How to ensure we’ll not break things in prod?
AI Reliability Pyramid
1) Is the model degraded?
2) What is the reason?
Data Format Drift
Concept Drift
Concept Drift
Data exploration in production
Research:
Data Scientist makes
assumptions based on results
of data exploration
Data exploration in production
Research:
Data Scientist explores
datasets and makes
assumptions/hypothesis
Production:
The model works if and only
if the format and statistical
properties of prod data are
the same as in research
Push to Prod
Data exploration in production
Research:
Data Scientist makes
assumptions based on results
of data exploration
Production:
The model works if and only
if format and statistical
properties of prod data are
the same as in research
Push to Prod
Continuous data exploration
and validation?
Automatic Data Profiling
● Avro/Protobuf schema can catch data format drifts
● Statistical properties of input features are to be
captured and continously validated
{"name": "User",
"fields": [
{"name": "name", "type": "string", "min_length": 2, "max_length": 128},
{"name": "age", "type": ["int", "null"], "range": "[10, 100]"},
{"name": "sex", "type": ["string", "null"], " enum": "[male, female, ...]"},
{"name": "wage", "type": ["int", "null"], "validator": "a-distance"}
]
}
Quality metrics generated from
data profile checks
How to deal with
- multidimensional dataset
- data timeliness
- data completeness
- image data
- complicated seasonality?
Anomaly detection
● Rule based programs -> statistical models -> machine
learning models
● Deal with multidimensional datasets, timeliness and
complicated seasonality
Model Monitoring Metrics on streaming data
● System metrics (latency/throughput)
● Kolmogorov-Smirnov
● Q-Q plot, t-digest
● Spearman and Pearson correlations
● Density based clustering algorithms with Elbow or
Silhouette methods
● Deep Autoencoders
● Generative Adversarial Networks
● Random Cut Forest (AWS paper)
● “Bring your own” metric
GANs for monitoring data quality at serving time
{production input}
{good}
{drift (fake)}
Model server = Metadata + Model Artifact +
Runtime + Deps + Sidecar + Training Metadata
/predict
input:
output:
JVM DL4j / TF / Other
GPU
CPU
model v2
[
....
]
gRPC HTTP server
sidecar
serving
requests
training data stats:
- min, max
- range
- clusters
- quantiles
- autoencoder
compare with prod
data in runtime
Change of the Paradigm
Shifts experimentation to
prod/shadowed environment
Use Case: Kolmogorov-Smirnov in action
Use Case: Monitoring NLU system
Figure from: Bapna, Ankur, et al. "Towards zero-shot frame semantic parsing for domain scaling."
arXiv preprint arXiv:1707.02363 (2017).
Use Case: Monitoring NLU system
Source image: Kurata, Gakuto, et al. "Leveraging sentence-level information with encoder lstm for semantic slot filling." arXiv preprint
arXiv:1601.01530 (2016).
● Train and test offline on restaurants domain
● Deploy do prod
● Feed the model with new random Wiki data
● Monitor intermediate input representations (neural network hidden states)
Use Case: Monitoring NLU system
● Red and Purple - cluster
of “Bad” production data
● Yellow and Blue - dev and
test data
AI Reliability Pyramid
Drift Handling
● Unexpected or dramatic drift? - Alert and add
ML/Data Engineer into the loop.
● Expected drift? - Retrain.
Open question to be solved with ML: classify expected
vs. unexpected drift.
Model Retraining - common questions
When to retrain?
When/how to push to prod?
What data to retraining with?
Manually on demand
Works well for 1 model
But does not scale
Model Retraining - common questions
When to retrain?
When/how to push to prod safely?
What data to retraining with?
Manually on demand
Works well for 1 model
But does not scale
Automatically with the
latest batch
Not safe
Can be expensive
The latest batch may
not be representative
Solution: Reactive AI powered retraining
Thank you
- Stepan Pushkarev
- @hydrospheredata
- https://github.com/Hydrospheredata
- https://hydrosphere.io/
- spushkarev@hydrosphere.io

More Related Content

What's hot

MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
Sasha Rosenbaum
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.
Knoldus Inc.
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
Carl W. Handlin
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scale
Noriaki Tatsumi
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
Michael Gerke
 
The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoML
Ning Jiang
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
Jordan Birdsell
 
MLOps with Azure DevOps
MLOps with Azure DevOpsMLOps with Azure DevOps
MLOps with Azure DevOps
Marco Parenzan
 
The A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOpsThe A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOps
DataPhoenix
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
Ning Jiang
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
cnvrg.io AI OS - Hands-on ML Workshops
 
Ml ops past_present_future
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_future
Nisha Talagala
 
Concept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsConcept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML Applications
Lightbend
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
Hayim Makabee
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to production
Herman Wu
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
doppenhe
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerAmazon Web Services
 

What's hot (20)

MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scale
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoML
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
MLOps with Azure DevOps
MLOps with Azure DevOpsMLOps with Azure DevOps
MLOps with Azure DevOps
 
The A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOpsThe A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOps
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
 
Ml ops past_present_future
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_future
 
Concept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsConcept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML Applications
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to production
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMaker
 

Similar to Monitoring AI with AI

Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
All Things Open
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
dtz001
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in production
Stepan Pushkarev
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
Jasjeet Thind
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
Stepan Pushkarev
 
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
VMware Tanzu
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
James Anderson
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
Data Science Milan
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
Justin Basilico
 
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019 Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Chun-Yu Tseng
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
Bill Liu
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
MLconf
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
Data Science in the Elastic Stack
Data Science in the Elastic StackData Science in the Elastic Stack
Data Science in the Elastic Stack
Rochelle Sonnenberg
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
EPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHUEPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHU
Dmitrii Suslov
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 

Similar to Monitoring AI with AI (20)

Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in production
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019 Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
Data Science in the Elastic Stack
Data Science in the Elastic StackData Science in the Elastic Stack
Data Science in the Elastic Stack
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
EPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHUEPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHU
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 

More from Stepan Pushkarev

AI for the Human Retina to Protect Newborn Vision
AI for the Human Retina to Protect Newborn VisionAI for the Human Retina to Protect Newborn Vision
AI for the Human Retina to Protect Newborn Vision
Stepan Pushkarev
 
Automating machine learning lifecycle with kubeflow
Automating machine learning lifecycle with kubeflowAutomating machine learning lifecycle with kubeflow
Automating machine learning lifecycle with kubeflow
Stepan Pushkarev
 
Handling inference in anomalous ever changing environment
Handling inference in anomalous ever changing environmentHandling inference in anomalous ever changing environment
Handling inference in anomalous ever changing environment
Stepan Pushkarev
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learning
Stepan Pushkarev
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
Stepan Pushkarev
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
Stepan Pushkarev
 

More from Stepan Pushkarev (7)

AI for the Human Retina to Protect Newborn Vision
AI for the Human Retina to Protect Newborn VisionAI for the Human Retina to Protect Newborn Vision
AI for the Human Retina to Protect Newborn Vision
 
Automating machine learning lifecycle with kubeflow
Automating machine learning lifecycle with kubeflowAutomating machine learning lifecycle with kubeflow
Automating machine learning lifecycle with kubeflow
 
Handling inference in anomalous ever changing environment
Handling inference in anomalous ever changing environmentHandling inference in anomalous ever changing environment
Handling inference in anomalous ever changing environment
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learning
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 

Recently uploaded

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
NaapbooksPrivateLimi
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
XfilesPro
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 

Recently uploaded (20)

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 

Monitoring AI with AI

  • 1. Monitoring AI with AI Stepan Pushkarev CTO of Hydrosphere.io
  • 2. Mission: Accelerate Machine Learning to Production Opensource Products: - ML Lambda: ML Deployment and Serving - Sonar: Data and ML Monitoring - Mist: Serverless proxy for Spark Business Model: PaaS and hands-on consulting About
  • 3. Traditional Software Machine Learning applications Explicit business rules ML generated model Unit testing Model Evaluation (Micro)service Model as a Service Docker per service Docker per Model 1 version of Microservice in prod 1-10-20 model versions in prod at a time Eng + QA team owning a service 1 ML Engineer owning 10-20 models Fail loudly (exception, stack trace) Fail silently Can work forever if verified Performance declines over time Needs continuous retraining / redeployment App metrics monitoring Data Monitoring | Model Metrics Monitoring
  • 4. Cost of an AI/ML Error ● Fun © http://blog.ycombinator.com/how-adversarial-attacks-work/
  • 5. ● Fun ● Not fun Cost of an AI Error
  • 6. ● Fun ● Not fun ● Not fun at all... Cost of an AI Error
  • 7. ● Fun ● Not fun ● Not fun at all… ● Money Cost of an AI Error
  • 8. ● Fun ● Not fun ● Not fun at all… ● Money ● Business Cost of an AI Error
  • 9. Where/why may AI fail in prod?
  • 10. Where/why may AI fail in prod? Everywhere!
  • 11. Where/why may AI fail in prod? ● Bad training data ● Bad serving data ● Training/serving data skew ● Misconfiguration ● Deployment issue ● Retraining issue ● Performance ● Concept Drift Everywhere!
  • 13. Reliable Training-Serving pipelines Comfort Zone for Data Scientist in the middle of Production
  • 15. Model Deployment and integration model.pkl model.zip How to integrate it into AI Application?
  • 16. Model server = Model Artifact + Metadata + Runtime + Deps + Sidecar /predict input: string text; bytes image; output: string summary; JVM DL4j GPU matching_model v2 [ .... ] gRPC HTTP server routing, shadowing pipelining tracing metrics autoscaling A/B, canary sidecar serving requests
  • 17. Model Deployment takeaways ● Eliminates hand-off between Data Scientist -> ML Eng -> Data Eng -> SA Eng -> QA -> Ops ● Sticks components together: Data + Model + Applications + Automation = AI Application ● Enables quick transition from research to production. ML engineers can deploy models many times a day But wait… This is not safe! How to ensure we’ll not break things in prod?
  • 18. AI Reliability Pyramid 1) Is the model degraded? 2) What is the reason?
  • 22. Data exploration in production Research: Data Scientist makes assumptions based on results of data exploration
  • 23. Data exploration in production Research: Data Scientist explores datasets and makes assumptions/hypothesis Production: The model works if and only if the format and statistical properties of prod data are the same as in research Push to Prod
  • 24. Data exploration in production Research: Data Scientist makes assumptions based on results of data exploration Production: The model works if and only if format and statistical properties of prod data are the same as in research Push to Prod Continuous data exploration and validation?
  • 25. Automatic Data Profiling ● Avro/Protobuf schema can catch data format drifts ● Statistical properties of input features are to be captured and continously validated {"name": "User", "fields": [ {"name": "name", "type": "string", "min_length": 2, "max_length": 128}, {"name": "age", "type": ["int", "null"], "range": "[10, 100]"}, {"name": "sex", "type": ["string", "null"], " enum": "[male, female, ...]"}, {"name": "wage", "type": ["int", "null"], "validator": "a-distance"} ] }
  • 26. Quality metrics generated from data profile checks
  • 27. How to deal with - multidimensional dataset - data timeliness - data completeness - image data - complicated seasonality?
  • 28.
  • 29. Anomaly detection ● Rule based programs -> statistical models -> machine learning models ● Deal with multidimensional datasets, timeliness and complicated seasonality
  • 30. Model Monitoring Metrics on streaming data ● System metrics (latency/throughput) ● Kolmogorov-Smirnov ● Q-Q plot, t-digest ● Spearman and Pearson correlations ● Density based clustering algorithms with Elbow or Silhouette methods ● Deep Autoencoders ● Generative Adversarial Networks ● Random Cut Forest (AWS paper) ● “Bring your own” metric
  • 31. GANs for monitoring data quality at serving time {production input} {good} {drift (fake)}
  • 32. Model server = Metadata + Model Artifact + Runtime + Deps + Sidecar + Training Metadata /predict input: output: JVM DL4j / TF / Other GPU CPU model v2 [ .... ] gRPC HTTP server sidecar serving requests training data stats: - min, max - range - clusters - quantiles - autoencoder compare with prod data in runtime
  • 33. Change of the Paradigm Shifts experimentation to prod/shadowed environment
  • 35. Use Case: Monitoring NLU system Figure from: Bapna, Ankur, et al. "Towards zero-shot frame semantic parsing for domain scaling." arXiv preprint arXiv:1707.02363 (2017).
  • 36. Use Case: Monitoring NLU system Source image: Kurata, Gakuto, et al. "Leveraging sentence-level information with encoder lstm for semantic slot filling." arXiv preprint arXiv:1601.01530 (2016). ● Train and test offline on restaurants domain ● Deploy do prod ● Feed the model with new random Wiki data ● Monitor intermediate input representations (neural network hidden states)
  • 37. Use Case: Monitoring NLU system ● Red and Purple - cluster of “Bad” production data ● Yellow and Blue - dev and test data
  • 39. Drift Handling ● Unexpected or dramatic drift? - Alert and add ML/Data Engineer into the loop. ● Expected drift? - Retrain. Open question to be solved with ML: classify expected vs. unexpected drift.
  • 40. Model Retraining - common questions When to retrain? When/how to push to prod? What data to retraining with? Manually on demand Works well for 1 model But does not scale
  • 41. Model Retraining - common questions When to retrain? When/how to push to prod safely? What data to retraining with? Manually on demand Works well for 1 model But does not scale Automatically with the latest batch Not safe Can be expensive The latest batch may not be representative
  • 42. Solution: Reactive AI powered retraining
  • 43. Thank you - Stepan Pushkarev - @hydrospheredata - https://github.com/Hydrospheredata - https://hydrosphere.io/ - spushkarev@hydrosphere.io