SlideShare a Scribd company logo
Tooling for Machine Learning:
AWS Products, Open Source
Tools, and DevOps Practices
Squadex Consulting Services
Stepan Pushkarev
Platform Engineer
Rinat Gareev
ML Engineer
Iskandar Sitdikov
ML Engineer
Overview
● Data Pipelines
● Modeling & Training
● Deployment
Data Ingestion
● Streaming - as much as possible even if you are not realtime
○ 99% of world's data has a timestamp
○ Fast Data is better that Big Data
○ Rich API for time series data (late arrivals, replays, etc)
○ Streams - advanced durable file system
○ Monitoring and operations friendly
○ Windowing aggregations and stateful transformations
○ Enrichments right in the stream
○ Realtime!
Batch
AirFlow - DAG runner
Pros:
- Nice UI
- Easy to start, plugins ecosystem
Cons:
- Celery
- Challenging for Operations
Tip:
- Always use Docker or Kube
Operators
AWS Step Functions
Pros
- Managed by AWS
- Event driven
Cons:
- Less sophisticated flows
- JSON programming
- Managed by AWS
Batch - System Layer
- Versioning
- Contracts with input / output schemas and transformation rules (DSL)
- HTTP, S3, Postgres etc. clients to easily retrieve / persist / upload data
- Unified naming and folder structures based on service / resource /
timestamp / extension
- Transactions, state persistence
- Cleanup
- Metrics, Logging & Tracing
- Trick - allows switching to streaming any time!
Glue Catalogue - Must have
Glue ETL Jobs - Your ETL “System Layer” for Spark
Overview
● ML Process
● Data Pipelines
● Modeling & Training
● Deployment
Machine Learning on AWS
Amazon Machine Learning
Amazon SageMaker
AWS Deep Learning AMIs
* application services (Comprehend, Lex, Polly, Transcribe, Translate,
Rekognition) are out of scope
Machine Learning on AWS – AML
Amazon Machine Learning
Amazon SageMaker
AWS Deep Learning AMIs
Provide data,
no code* required
*optionally – JSON
Programming
Efforts
`
Flexibility
Amazon Machine Learning – Overview
Good for lazy & quick baselines
Inputs: CSV on S3, SQL over Redshift or RDS Mysql
Includes:
- Data visualization
- Data transformation utils (with inferred default ‘recipes’), Feature Selection
- Feature crosses
- Prediction targets: numeric – linear regression
categorical – (multinomial) logistic regression
- Evaluation
Amazon Machine Learning – Limits
No hyperparameter tuning (Cf. SageMaker Linear Learner)
No built-in cross-validation (requires additional scripting)
System limits: 100 KB per observation, max 10000 output features, max 100
classes, etc
Machine Learning on AWS – SageMaker
Amazon Machine Learning
Amazon SageMaker
AWS Deep Learning AMIs
A platform that consists of:
● model implementations
● model tuner
● notebook hosting
● model hosting
Programming
Efforts
`
Flexibility
SageMaker – Algorithms – Overview
Classification/Regression
● Linear Learner
● XGBoost
● Factorization Machine
● K-Nearest Neighbors
Image processing
● Image Classification
● Object Detection
Text processing
● BlazingText
● Topic Modeling – LDA
● Topic Modeling – NTM
Clustering
● K-Means
Time Series
● DeepAR Forecasting
Dimensionality
Reduction
● Principal Component
Analysis
Encoder-Decoder
● Seq2Seq
Anomaly Detection
● Random Cut Forest
Bring Your Own
SageMaker – BlazingText
Very fast & convenient FastText implementation
Use case:
1. Given few labelled texts for a ML problem and much more unlabelled
(domain-specific)
2. Train word embeddings on unlabelled data
3. Use these embeddings to initialize input layer of your neural network
Also: BlazingText provides supervised mode for text classification
SageMaker – Image Classification
Implementation of ‘Deep Residual Learning for Image Recognition’ aka ResNet.
Two modes:
● Full training – starts from scratch
● Transfer learning – starts from a model trained on ImageNet
○ different options available – 18-200 layers.
Other handy features: image resizing, augmentation.
Not good: nasty checkpointing, no early stopping.
TensorFlow on SageMaker
You provide a python* script with functions:
● model_fn
● train_input_fn
● eval_input_fn
● serving_input_fn
SageMaker does the rest (including TensorBoard in training mode).
* 😟 Lack of Python 3 support (so far)
Similar to TF Estimator API
SageMaker – Custom Algorithms
Bring Your Own
Docker Image
Training Inference
SageMaker provides:
● JSON with configurations of
data & cluster & hyperparameters
● mount data files (or pipes)
● model artifacts
SageMaker expects:
● train entry point
● model artifacts after completion
● serve entry point
● responds to /invocations and
/ping on port 8080
SageMaker does not
provide:
● access to the container, e.g., no
TensorBoard without tricks 😕
SageMaker – Automatic Model Tuning
Implementation of Bayesian optimization for hyperparameter value search
Good:
● easy-to-use, like Sklearn GridSearch
● supports BYO/custom models
● faster experimentation
● no mess with provisioning instances
● no pricing overhead
Not good:
● unlike Sklearn – no cross-validation
● unlike Sklearn – no fitting of an entire
pipeline (e.g., including feature extractors
and other transformers)
● support of spot instances won’t hurt
SageMaker – Algorithms – Pros and Cons
Not good:
‘S3-centered’, complicates fitting
feature extraction into pipeline
Good:
Unified IO system for training &
inference, + Pipe mode
Model agnostic Hyperparameter
Tuning
Distributed training
Flexible pricing for training – pay by the
second
All above is available for your custom
algorithm
Machine Learning on AWS – AWS DL AMIs
Amazon Machine Learning
Amazon SageMaker
AWS Deep Learning AMIs
Programming
Efforts
`
Flexibility
Freedom!
* https://en.wikipedia.org/wiki/File:Raft-slab.jpg (CC BY-SA 3.0)
Avoid headache with EC2 instance setup: installation of Nvidia drivers, python,
your favourite DL framework, compatibility warnings & errors, configuration,
optimization, etc.
Use cases:
● to use different data sources rather than S3
● to implement more sophisticated data pipelines
● to implement more complex training workflows
● to run long training on Spot Instances
AWS Deep Learning AMIs
SageMaker – Jupyter Notebooks
AWS value-add:
● Better authentication
● Ready-to-go environments
● Easy IAM setup
● Quick start with example notebooks
● Stop & Restore when necessary
○ Lifecycle configurations
Notebook Best Practices / versioning
- Git (cli, plugins,
extensions)
- Post-save hook to save
.py and .html files
alongside with notebook.
Notebook Best Practices / reproducing
Steps:
1. Configure variables
2. Install dependencies
3. List dependencies
4. Define constants
5. Load data
6. Load schema
- During instance boot
- Within notebook itself (hacky
way)
Notebook Best Practices / share & collaboration
Instance
Proxy
Kernel
Instance
Kernel
Notebook
storage
Browser
Multiuser
instance
Kernel Kernel
Browser
Notebook
storage
SageMaker, JupyterHub Colab, Zeppelin
● ML Process
● Data Pipelines
● Modeling & Training
● Deployment
Overview
SageMaker – Deployment
Pros:
- One click
- Single row & batch
- Traffic split, A/B
- Autoscale group
Cons
- EC2 instance per model
- No Contracts
- No shadow, replay tests
- No model metrics
- No model versioning
Alternative:
- Do it yourself on EKS/ECS
- Opensource: hydrosphere.io
Deployment Best Practice
/predict
input:
output:
JVM DL4j / TF / Other
GPU
CPU
model v2
[
....
]
gRPC HTTP server
sidecar
serving
requests
training data stats:
- min, max
- range
- clusters
- quantiles
- autoencoder
● Bad training data
● Bad serving data
● Training/serving data skew
● Misconfiguration
● Performance
● Concept Drift
● Adversarial input
● Upstream error
Model Maintenance
● Data contracts
● Data Profiling
● Model Performance monitoring
● Concept drift monitoring
● Smart subsampling
● Active retraining and active
learning
Model Maintenance - Best practices
Online Monitoring / Profiling infrastructure
- Learn the latent space
online in a stream of
production inputs
- Compare “online”
snapshots with offline
- Index encodings for
analysis / audit /
exploration
Stratified
sampling from the
distribution learnt
ML lifecycle orchestration
- Jenkins :)
- AWS CodePipeline
- StepFunctions
- KubeFlow
KubeFlow MLFlowvs.
MLFlow - experiments tracking and reproducibility
KubeFlow = argo + kube configs + ml tutorial
Takeaways
- Well architected Data Platform is a key enabler for successful ML
project
- AWS ML Ecosystem is a great foundation of your enterprise wide
ML platform but you should know your options and extra layer to be
built
- ML Reproducibility and ML lifecycle management is a wild space,
no off the shelf solution
squadex.com
125 University Avenue
Suite 290, Palo Alto
California, 94301
Questions, details?
We would be happy to answer!

More Related Content

What's hot

Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Sysml 2019 demo_paper
Sysml 2019 demo_paperSysml 2019 demo_paper
Sysml 2019 demo_paper
strange_loop
 
AutoML Toolkit – Deep Dive
AutoML Toolkit – Deep DiveAutoML Toolkit – Deep Dive
AutoML Toolkit – Deep Dive
Databricks
 
Productionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowProductionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflow
Databricks
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
SAP Adaptive Computing Design
SAP Adaptive Computing DesignSAP Adaptive Computing Design
SAP Adaptive Computing Design
Gary Jackson MBCS
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's Perspective
Ilya Ganelin
 
Software architectures for the cloud
Software architectures for the cloudSoftware architectures for the cloud
Software architectures for the cloudGeorgios Gousios
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
Benchmarking Performance and Scalability with Web Stress
Benchmarking Performance and Scalability with Web StressBenchmarking Performance and Scalability with Web Stress
Benchmarking Performance and Scalability with Web StressInterSystems Corporation
 
Oracle OpenWorld 2014 Review Part Four - PaaS Middleware
Oracle OpenWorld 2014 Review Part Four - PaaS MiddlewareOracle OpenWorld 2014 Review Part Four - PaaS Middleware
Oracle OpenWorld 2014 Review Part Four - PaaS Middleware
Getting value from IoT, Integration and Data Analytics
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
BK2015 Arkitektur sikkerhet og skalering
BK2015 Arkitektur sikkerhet og skaleringBK2015 Arkitektur sikkerhet og skalering
BK2015 Arkitektur sikkerhet og skalering
Geodata AS
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
Databricks
 
PandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled EnsemblesPandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled Ensembles
Databricks
 
Azure appfabric caching intro and tips
Azure appfabric caching intro and tipsAzure appfabric caching intro and tips
Azure appfabric caching intro and tips
Sachin Sancheti - Microsoft Azure Architect
 
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineOperational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos Christine
Spark Summit
 
Akka Streams - From Zero to Kafka
Akka Streams - From Zero to KafkaAkka Streams - From Zero to Kafka
Akka Streams - From Zero to Kafka
Mark Harrison
 
Meetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDTMeetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDT
Solidify
 

What's hot (20)

Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Sysml 2019 demo_paper
Sysml 2019 demo_paperSysml 2019 demo_paper
Sysml 2019 demo_paper
 
AutoML Toolkit – Deep Dive
AutoML Toolkit – Deep DiveAutoML Toolkit – Deep Dive
AutoML Toolkit – Deep Dive
 
Productionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowProductionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflow
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
SAP Adaptive Computing Design
SAP Adaptive Computing DesignSAP Adaptive Computing Design
SAP Adaptive Computing Design
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's Perspective
 
Software architectures for the cloud
Software architectures for the cloudSoftware architectures for the cloud
Software architectures for the cloud
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Benchmarking Performance and Scalability with Web Stress
Benchmarking Performance and Scalability with Web StressBenchmarking Performance and Scalability with Web Stress
Benchmarking Performance and Scalability with Web Stress
 
Oracle OpenWorld 2014 Review Part Four - PaaS Middleware
Oracle OpenWorld 2014 Review Part Four - PaaS MiddlewareOracle OpenWorld 2014 Review Part Four - PaaS Middleware
Oracle OpenWorld 2014 Review Part Four - PaaS Middleware
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
BK2015 Arkitektur sikkerhet og skalering
BK2015 Arkitektur sikkerhet og skaleringBK2015 Arkitektur sikkerhet og skalering
BK2015 Arkitektur sikkerhet og skalering
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
PandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled EnsemblesPandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled Ensembles
 
Azure appfabric caching intro and tips
Azure appfabric caching intro and tipsAzure appfabric caching intro and tips
Azure appfabric caching intro and tips
 
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineOperational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos Christine
 
Akka Streams - From Zero to Kafka
Akka Streams - From Zero to KafkaAkka Streams - From Zero to Kafka
Akka Streams - From Zero to Kafka
 
Meetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDTMeetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDT
 

Similar to Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Practices

AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
Provectus
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Sotrender
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
ML_Development_with_Sagemaker.pptx
ML_Development_with_Sagemaker.pptxML_Development_with_Sagemaker.pptx
ML_Development_with_Sagemaker.pptx
TemiReply
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
Valeriia Maliarenko
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
Rustem Feyzkhanov
 
High Performance Distributed TensorFlow with GPUs and Kubernetes
High Performance Distributed TensorFlow with GPUs and KubernetesHigh Performance Distributed TensorFlow with GPUs and Kubernetes
High Performance Distributed TensorFlow with GPUs and Kubernetes
inside-BigData.com
 
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
Piyush Kumar
 
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
DSDT_MTL
 
Azure machine learning service
Azure machine learning serviceAzure machine learning service
Azure machine learning service
Ruth Yakubu
 
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
BESPIN GLOBAL
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
camunda services GmbH
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
 
Scalability strategies for cloud based system architecture
Scalability strategies for cloud based system architectureScalability strategies for cloud based system architecture
Scalability strategies for cloud based system architecture
SangJin Kang
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Iulian Pintoiu
 
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
Craeg Strong
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
Databricks
 

Similar to Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Practices (20)

AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
ML_Development_with_Sagemaker.pptx
ML_Development_with_Sagemaker.pptxML_Development_with_Sagemaker.pptx
ML_Development_with_Sagemaker.pptx
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
 
High Performance Distributed TensorFlow with GPUs and Kubernetes
High Performance Distributed TensorFlow with GPUs and KubernetesHigh Performance Distributed TensorFlow with GPUs and Kubernetes
High Performance Distributed TensorFlow with GPUs and Kubernetes
 
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
 
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Azure machine learning service
Azure machine learning serviceAzure machine learning service
Azure machine learning service
 
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Scalability strategies for cloud based system architecture
Scalability strategies for cloud based system architectureScalability strategies for cloud based system architecture
Scalability strategies for cloud based system architecture
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019
 
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
 

More from SQUADEX

Osdn serverless technologies with kubernetes
Osdn serverless technologies with kubernetes Osdn serverless technologies with kubernetes
Osdn serverless technologies with kubernetes
SQUADEX
 
Spark as etl_squadex
Spark as etl_squadexSpark as etl_squadex
Spark as etl_squadex
SQUADEX
 
Squadex DevOps Trainings
Squadex DevOps TrainingsSquadex DevOps Trainings
Squadex DevOps Trainings
SQUADEX
 
Data driven culture & infrastructure from the ground up
Data driven culture & infrastructure from the ground upData driven culture & infrastructure from the ground up
Data driven culture & infrastructure from the ground up
SQUADEX
 
Enterprise level cloud CI
Enterprise level cloud CIEnterprise level cloud CI
Enterprise level cloud CI
SQUADEX
 
Canary releases & Blue green deployment
Canary releases & Blue green deploymentCanary releases & Blue green deployment
Canary releases & Blue green deployment
SQUADEX
 
Building DevOps culture from bottom up
Building DevOps culture from bottom upBuilding DevOps culture from bottom up
Building DevOps culture from bottom up
SQUADEX
 
Kubernetes as a cloud for CI
Kubernetes as a cloud for CIKubernetes as a cloud for CI
Kubernetes as a cloud for CI
SQUADEX
 

More from SQUADEX (8)

Osdn serverless technologies with kubernetes
Osdn serverless technologies with kubernetes Osdn serverless technologies with kubernetes
Osdn serverless technologies with kubernetes
 
Spark as etl_squadex
Spark as etl_squadexSpark as etl_squadex
Spark as etl_squadex
 
Squadex DevOps Trainings
Squadex DevOps TrainingsSquadex DevOps Trainings
Squadex DevOps Trainings
 
Data driven culture & infrastructure from the ground up
Data driven culture & infrastructure from the ground upData driven culture & infrastructure from the ground up
Data driven culture & infrastructure from the ground up
 
Enterprise level cloud CI
Enterprise level cloud CIEnterprise level cloud CI
Enterprise level cloud CI
 
Canary releases & Blue green deployment
Canary releases & Blue green deploymentCanary releases & Blue green deployment
Canary releases & Blue green deployment
 
Building DevOps culture from bottom up
Building DevOps culture from bottom upBuilding DevOps culture from bottom up
Building DevOps culture from bottom up
 
Kubernetes as a cloud for CI
Kubernetes as a cloud for CIKubernetes as a cloud for CI
Kubernetes as a cloud for CI
 

Recently uploaded

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 

Recently uploaded (20)

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 

Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Practices

  • 1. Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Practices
  • 3. Stepan Pushkarev Platform Engineer Rinat Gareev ML Engineer Iskandar Sitdikov ML Engineer
  • 4.
  • 5. Overview ● Data Pipelines ● Modeling & Training ● Deployment
  • 6. Data Ingestion ● Streaming - as much as possible even if you are not realtime ○ 99% of world's data has a timestamp ○ Fast Data is better that Big Data ○ Rich API for time series data (late arrivals, replays, etc) ○ Streams - advanced durable file system ○ Monitoring and operations friendly ○ Windowing aggregations and stateful transformations ○ Enrichments right in the stream ○ Realtime!
  • 7.
  • 8.
  • 9. Batch AirFlow - DAG runner Pros: - Nice UI - Easy to start, plugins ecosystem Cons: - Celery - Challenging for Operations Tip: - Always use Docker or Kube Operators AWS Step Functions Pros - Managed by AWS - Event driven Cons: - Less sophisticated flows - JSON programming - Managed by AWS
  • 10. Batch - System Layer - Versioning - Contracts with input / output schemas and transformation rules (DSL) - HTTP, S3, Postgres etc. clients to easily retrieve / persist / upload data - Unified naming and folder structures based on service / resource / timestamp / extension - Transactions, state persistence - Cleanup - Metrics, Logging & Tracing - Trick - allows switching to streaming any time!
  • 11. Glue Catalogue - Must have Glue ETL Jobs - Your ETL “System Layer” for Spark
  • 12. Overview ● ML Process ● Data Pipelines ● Modeling & Training ● Deployment
  • 13. Machine Learning on AWS Amazon Machine Learning Amazon SageMaker AWS Deep Learning AMIs * application services (Comprehend, Lex, Polly, Transcribe, Translate, Rekognition) are out of scope
  • 14. Machine Learning on AWS – AML Amazon Machine Learning Amazon SageMaker AWS Deep Learning AMIs Provide data, no code* required *optionally – JSON Programming Efforts ` Flexibility
  • 15. Amazon Machine Learning – Overview Good for lazy & quick baselines Inputs: CSV on S3, SQL over Redshift or RDS Mysql Includes: - Data visualization - Data transformation utils (with inferred default ‘recipes’), Feature Selection - Feature crosses - Prediction targets: numeric – linear regression categorical – (multinomial) logistic regression - Evaluation
  • 16. Amazon Machine Learning – Limits No hyperparameter tuning (Cf. SageMaker Linear Learner) No built-in cross-validation (requires additional scripting) System limits: 100 KB per observation, max 10000 output features, max 100 classes, etc
  • 17. Machine Learning on AWS – SageMaker Amazon Machine Learning Amazon SageMaker AWS Deep Learning AMIs A platform that consists of: ● model implementations ● model tuner ● notebook hosting ● model hosting Programming Efforts ` Flexibility
  • 18. SageMaker – Algorithms – Overview Classification/Regression ● Linear Learner ● XGBoost ● Factorization Machine ● K-Nearest Neighbors Image processing ● Image Classification ● Object Detection Text processing ● BlazingText ● Topic Modeling – LDA ● Topic Modeling – NTM Clustering ● K-Means Time Series ● DeepAR Forecasting Dimensionality Reduction ● Principal Component Analysis Encoder-Decoder ● Seq2Seq Anomaly Detection ● Random Cut Forest Bring Your Own
  • 19. SageMaker – BlazingText Very fast & convenient FastText implementation Use case: 1. Given few labelled texts for a ML problem and much more unlabelled (domain-specific) 2. Train word embeddings on unlabelled data 3. Use these embeddings to initialize input layer of your neural network Also: BlazingText provides supervised mode for text classification
  • 20. SageMaker – Image Classification Implementation of ‘Deep Residual Learning for Image Recognition’ aka ResNet. Two modes: ● Full training – starts from scratch ● Transfer learning – starts from a model trained on ImageNet ○ different options available – 18-200 layers. Other handy features: image resizing, augmentation. Not good: nasty checkpointing, no early stopping.
  • 21. TensorFlow on SageMaker You provide a python* script with functions: ● model_fn ● train_input_fn ● eval_input_fn ● serving_input_fn SageMaker does the rest (including TensorBoard in training mode). * 😟 Lack of Python 3 support (so far) Similar to TF Estimator API
  • 22. SageMaker – Custom Algorithms Bring Your Own Docker Image Training Inference SageMaker provides: ● JSON with configurations of data & cluster & hyperparameters ● mount data files (or pipes) ● model artifacts SageMaker expects: ● train entry point ● model artifacts after completion ● serve entry point ● responds to /invocations and /ping on port 8080 SageMaker does not provide: ● access to the container, e.g., no TensorBoard without tricks 😕
  • 23. SageMaker – Automatic Model Tuning Implementation of Bayesian optimization for hyperparameter value search Good: ● easy-to-use, like Sklearn GridSearch ● supports BYO/custom models ● faster experimentation ● no mess with provisioning instances ● no pricing overhead Not good: ● unlike Sklearn – no cross-validation ● unlike Sklearn – no fitting of an entire pipeline (e.g., including feature extractors and other transformers) ● support of spot instances won’t hurt
  • 24. SageMaker – Algorithms – Pros and Cons Not good: ‘S3-centered’, complicates fitting feature extraction into pipeline Good: Unified IO system for training & inference, + Pipe mode Model agnostic Hyperparameter Tuning Distributed training Flexible pricing for training – pay by the second All above is available for your custom algorithm
  • 25. Machine Learning on AWS – AWS DL AMIs Amazon Machine Learning Amazon SageMaker AWS Deep Learning AMIs Programming Efforts ` Flexibility Freedom! * https://en.wikipedia.org/wiki/File:Raft-slab.jpg (CC BY-SA 3.0)
  • 26. Avoid headache with EC2 instance setup: installation of Nvidia drivers, python, your favourite DL framework, compatibility warnings & errors, configuration, optimization, etc. Use cases: ● to use different data sources rather than S3 ● to implement more sophisticated data pipelines ● to implement more complex training workflows ● to run long training on Spot Instances AWS Deep Learning AMIs
  • 27. SageMaker – Jupyter Notebooks AWS value-add: ● Better authentication ● Ready-to-go environments ● Easy IAM setup ● Quick start with example notebooks ● Stop & Restore when necessary ○ Lifecycle configurations
  • 28. Notebook Best Practices / versioning - Git (cli, plugins, extensions) - Post-save hook to save .py and .html files alongside with notebook.
  • 29. Notebook Best Practices / reproducing Steps: 1. Configure variables 2. Install dependencies 3. List dependencies 4. Define constants 5. Load data 6. Load schema - During instance boot - Within notebook itself (hacky way)
  • 30. Notebook Best Practices / share & collaboration Instance Proxy Kernel Instance Kernel Notebook storage Browser Multiuser instance Kernel Kernel Browser Notebook storage SageMaker, JupyterHub Colab, Zeppelin
  • 31. ● ML Process ● Data Pipelines ● Modeling & Training ● Deployment Overview
  • 32. SageMaker – Deployment Pros: - One click - Single row & batch - Traffic split, A/B - Autoscale group Cons - EC2 instance per model - No Contracts - No shadow, replay tests - No model metrics - No model versioning Alternative: - Do it yourself on EKS/ECS - Opensource: hydrosphere.io
  • 33. Deployment Best Practice /predict input: output: JVM DL4j / TF / Other GPU CPU model v2 [ .... ] gRPC HTTP server sidecar serving requests training data stats: - min, max - range - clusters - quantiles - autoencoder
  • 34. ● Bad training data ● Bad serving data ● Training/serving data skew ● Misconfiguration ● Performance ● Concept Drift ● Adversarial input ● Upstream error Model Maintenance
  • 35. ● Data contracts ● Data Profiling ● Model Performance monitoring ● Concept drift monitoring ● Smart subsampling ● Active retraining and active learning Model Maintenance - Best practices
  • 36. Online Monitoring / Profiling infrastructure - Learn the latent space online in a stream of production inputs - Compare “online” snapshots with offline - Index encodings for analysis / audit / exploration
  • 38. ML lifecycle orchestration - Jenkins :) - AWS CodePipeline - StepFunctions - KubeFlow
  • 40. MLFlow - experiments tracking and reproducibility
  • 41. KubeFlow = argo + kube configs + ml tutorial
  • 42. Takeaways - Well architected Data Platform is a key enabler for successful ML project - AWS ML Ecosystem is a great foundation of your enterprise wide ML platform but you should know your options and extra layer to be built - ML Reproducibility and ML lifecycle management is a wild space, no off the shelf solution
  • 43. squadex.com 125 University Avenue Suite 290, Palo Alto California, 94301 Questions, details? We would be happy to answer!