MLOps and Reproducible ML on AWS with Kubeflow and Amazon SageMaker

MLOps and Reproducible ML on AWS
with Kubeflow and Amazon SageMaker
Presented by:
Stepan Pushkarev, CTO @ Provectus
Qingwei Li, ML Specialist Solutions Architect @ AWS

1. Learn how to a build scalable and secure ML Infrastructure on AWS with
Provectus
2. Explore best practices of using Amazon SageMaker with open source tools
for better experience and productivity
Webinar Objectives

1. Familiarity with AWS & Amazon SageMaker services
2. Familiarity with ML Workflow
3. Familiarity with Kubeflow & Kubeflow Pipelines
Webinar Prerequisites

1. Introductions
2. Case Study: GoCheck Kids
3. Overview of AWS Infrastructure for Machine Learning
4. Provectus ML Infrastructure on AWS
a. Experimentation
b. MLOps
c. Feature Store
Agenda

AI-First Consultancy & Solutions Provider
Сlients ranging from
fast-growing startups to
large enterprises
450 employees and
growing
Established in 2010
HQ in Palo Alto
Offices across the US,
Canada, and Europe
We are obsessed about leveraging cloud, data, and AI to reimagine the way
businesses operate, compete, and deliver customer value

Innovative Tech Vendors
Seeking for niche expertise to
differentiate and win the market
Midsize to Large Enterprises
Seeking to accelerate innovation,
achieve operational excellence
Our Clients

Introductions
Stepan Pushkarev
Chief Technology
Officer, Provectus
Iskandar Sitdikov
ML Solutions Architect,
Provectus
Rinat Gareev
Provectus
Ilnur Garifullin
Provectus
Qingwei Li
ML Specialist Solutions
Architect, AWS

The past few years have been like a dream come true for those who work in
analytics and big data.There is a new career path for platform engineers to learn
Hadoop, Scala and Spark. Java and Python programmers have a chance to move
to the Big Data world. There they find higher salaries, new challenges and get
to scale up to distributed systems. But recently I am starting to hear some
complaints and dashed hopes from engineers who have spent time working there.

1. Tools evolution — The Apache Spark/Hadoop ecosystem is great, but it is not stable and user-friendly enough
to just run and forget. Engineers and data scientists should contribute to existing open source projects and create
new tools to fill the gaps in day-to-day operations.
2. Education and cross skills — When data scientists write code, they need to think not just about abstractions,
but consider the practical issues of what is possible and what is reasonable. For example, they need to think how
long their query will run and whether the data they extract will fit into the storage mechanism they are using.
3. Improve the process — DevOps might be a solution. Here DevOps does not just mean writing Ansible scripts
and installing Jenkins. We need DevOps working in optimal fashion to reduce handoffs and invent new tools to
give everyone self-service to make them as productive as possible.

Why ML Infrastructure
GoCheck Kids Story: Secure, agile, and compliant ML
infrastructure for Deep Vision Screening

Reduce manual overhead for child vision
screening.
Detect strabismus, crescent, dark iris/pupil
population, as well as to reject images where
child is not looking straight into the camera.
Security and compliance requirements - Track
everything, do not touch anything.
Deep Vision Solution for GoCheck Kids
Business Problem Solution
End-to-end deep learning image classification
models to detect child gaze, strabismus,
crescent, and dark iris/pupil population.

Provectus has developed quite a few ML models:
● Different input (pre-processing, region cropping, single vs two eyes, etc.), 6
● Different feature generation backbones (deep convolutional networks: ResNet,
MobileNet, EfficientNet, custom, etc.), 7
● Transfer learning from a synthetic dataset, 3
● Tweaks with objective functions to tackle data imbalance, 5
● Different datasets splits, 10
Modeling Hypothesis
6x7x3x5x10 = 6,300 combinations to test in 3 weeks!

Conducted ~100* experiments on the entire dataset using pipelines within 3 weeks
● 100 000+ images
● Each experiment takes 15 min – 6 hours on a single GPU (P3 instance type)
* not counting development runs and experiments in notebook instances
We always had quite a few pending improvement hypotheses in backlog
● Each good hypothesis needs several runs to determine best hyperparameters
● OR automatic hyperparameter optimizer
Data preparation took ~5 hours
● Had to parallelize and reuse outputs
Each experiment produces artifacts: models, metrics, predictions
Met security and compliance requirements
Benefits and Outcomes of ML Infrastructure

Results Summary
3X
Increase in ML
model’s recall
(same precision)
95%
ML Engineer’s time
was dedicated to
experimentation
100+
Large scale
experiments in 3
weeks by 3 ML
engineers
This could not be achieved without Provectus ML Infrastructure on AWS
100%
Secure and FDA
Compliant

Overview of AWS Infrastructure
for Machine Learning

VISION SPEECH TEXT SEARCH NEW CHATBOTS PERSONALIZATION FORECASTING FRAUD NEW DEVELOPMENT NEW CONTACT CENTERS
Amazon SageMaker
Amazon
SageMaker
Ground
Truth
Amazon
A2I
Amazon
SageMaker
Neo
Built-in
algorithms
SageMaker
Notebooks NEW
SageMaker
Experiments NEW
Model
tuning
SageMaker
Debugger NEW
SageMaker
Autopilot NEW
Model
hosting
SageMaker
Model Monitor NEW
Deep Learning
AMIs & Containers
GPUs &
CPUs
Elastic
Inference
Inferentia FPGA
Amazon
Rekognition
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Comprehend
+Medical
Amazon
Translate
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Fraud Detector
Amazon
CodeGuru
AWS AI Services
AWS ML Services
AWS ML Frameworks & Infrastructure
Amazon
Textract
Amazon
Kendra
Contact Lens
For Amazon Connect
Amazon SageMaker Studio IDE
NEW
NEW NEW
AWS AI/ML Stack

Amazon SageMaker - A Fully Managed Services for ML
10101101
0
0101010
Collect
and prepare
training data
Select or
Build ML
algorithms
Set up and
manage
environments
for training
Train, debug,
and tune
models
Deploy
models in
production
Manage
training runs
Monitor
models
Scale and manage
the production
environment
Validate
predictions

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Image registry
Container image repository
Amazon Elastic
Container Registry
(Amazon ECR)
Compute
Where the containers run
Amazon Elastic
Compute Cloud
(Amazon EC2)
Jupyter notebook
instances
High performance
algorithms
Large-scale
training
Optimization One-click
deployment
Fully managed with
auto-scaling
ML services
Fully-managed service that
covers the entire machine
learning workflow
Amazon SageMaker
Management
Deployment, scheduling,
scaling, and management of
containerized applications
Amazon Elastic
Kubernetes Service
(Amazon EKS)
Amazon Elastic
Container Service
(Amazon ECS)
ML Infrastructure and Services
1
2

Kubernetes
Amazon SageMaker Operators
for Kubernetes
github.com/aws/amazon-sagemaker-operator-for-k8s
Kubeflow
Amazon SageMaker Components
for Kubeflow Pipelines
github.com/kubeflow/pipelines/tree/master/components/
aws/sagemaker
Scaling ML on Kubernetes with Amazon SageMaker
2
1

• Fully-managed infrastructure
• Ground Truth labeling
• Automatic model tuning
• Built-in optimized algorithms
• Managed Spot Training
• Scalable inference endpoints
• Model monitoring
• Easy scalability
• Portability
• Composability
• Scalability
• Shared infrastructure
• Repeatable pipelines
• Automation
• CI/CD
• Open-source
Open Source + Amazon SageMaker Value Proposition
Amazon SageMaker Kubeflow

Kubeflow Pipeline
Component
Other
component
Pipeline
step
Pipeline
step
Pipeline
step
Input/Output
Implementation
(container)
Metadata
Amazon
ECR
Amazon
SageMaker
Amazon SageMaker Components for Kubeflow Pipelines
Other
component

Example pipeline:
1. Hyperparameter optimization
2. Select best hyperparameters and increase epochs
3. Training model using the best hyperparameters
4. Create an Amazon SageMaker model
5. Deploy the model
BYO containerBYO training scripts
Amazon SageMaker Components for Kubeflow Pipelines

Model
development
Model
training
Model
tracking
Model
deployment
Hyper-param
tuning
Data
prep
Amazon SageMaker + Kubeflow for Machine Learning
Amazon SageMaker

Kubernetes
Amazon SageMaker Operators
for Kubernetes
github.com/aws/amazon-sagemaker-operator-for-k8s
Kubeflow
Amazon SageMaker Components
for Kubeflow Pipelines
github.com/kubeflow/pipelines/tree/master/componen
ts/aws/sagemaker
Scaling ML on Kubernetes with Amazon SageMaker
1
2

Product Architecture Kubernetes Orchestration Dev Interface GUI Ease of Use
SageMaker
Components
Kubeflow
Pipeline
Components
Yes
Self Hosted
Kubeflow
Pipelines
Python
KFP
Dashboard
Medium
SageMaker
Operators
Kubernetes
Operators
Customer
Resources
Yes
Kubernetes
Tools (Ex.
Flyte, Argo)
YAML,
or custom
extension
by customer
None,
or custom
Advanced
Amazon SageMaker Operators for Kubernetes vs.
Components for Kubeflow Pipelines

Provectus ML Infrastructure
on AWS

How Provectus Adds Value
Feature Store
Store and reuse features to build ML models faster
ML Workflow Orchestrator
Reproduce and track the whole ML Workflow
Dataset Management
Track and govern training datasets
Dataset Sampling
Sample from production
streams
Advanced Monitoring
Detect drift in text & images
MLOps
Continuous Training & Delivery

The Core of MLOps and Reproducible Experimentation
Pipelines

1. Backbone of Experimentation flow
2. Essential part of Continuous Integration and Delivery flow
3. Major part of Continuous Retraining flow
4. Production workload (unlike traditional CI/CD)
5. Part of day-to-day model tuning and development process
6. Idempotent — Should produce the same results with the same inputs
ML Pipeline Characteristics

ML Pipeline Options
Component
/Option
Amazon SageMaker
Managed
AWS
Native
Kubernetes
Native
DSL
Orchestrator
Metadata
Tracker & UI
Integrations (Tuner,
Debugger,
TensorBoard, etc)

ML Pipeline Options
Component
/Option
Amazon SageMaker
Managed
AWS
Native
Kubernetes
Native
DSL SageMaker Processing Data Science SDK
for Step Functions
Kubeflow Pipelines
Orchestrator SageMaker Processing Step Functions Argo Workflow
Metadata
Tracker & UI
Amazon SageMaker
Experiments
N/A Kubeflow
Metadata
Integrations (Tuner,
Debugger,
TensorBoard, etc)
Amazon SageMaker
Services DIY
Opensource, Amazon
SageMaker
Components

Kubeflow: Orchestrator and Experiments Tracker of Choice

End-to-end
Amazon
SageMaker +
Kubeflow
Pipelines

MLOps with
Argo Workflows,
Amazon SageMaker,
& Kubeflow

Summary of Kubeflow on AWS
Best Practices:
● Invest into a library of reusable components
● Use Amazon SageMaker Components for Kubeflow
● Deploy on Amazon EKS, consider Provectus Swiss Army
Kube for a quick start
● Use Argo and Kubeflow for MLOps
Benefits:
● Metadata Tracker and Pipeline Orchestrator
● Minimal intervention into existing day-to-day ML routines

Value Proposition of Feature Store
A data management layer for machine learning features.
1. Better ROI from feature engineering — Facilitates collaboration,
sharing and reusing of features
2. Increases ML Engineer productivity — Storage is further
decoupled from ML pipelines
3. Prevents training-serving data skew by design
4. Can encapsulate or facilitate data versioning and features
quality monitoring

Good News: A properly designed Data Lake
covers 80% of requirements for Feature Store

Higher Level Operations:
● Fetch batch (take a sample)
● Get one
● Add / Deprecate feature
Lineage Metadata:
● Upstream Models
● Data Sources and transformations
Annotation Metadata:
● Agreements
● Judgements
● Annotation job parameters
Adding ML Awareness to Data Lake
Data Profiling Metadata:
● Min/max
● Uniqueness, missing values, etc.
Governance Metadata:
● Owner
● Description
● Version
● Last updated, SLA

Feature Store: Options
Not a Store. General purpose Data Catalogue.
Adds nice UI, Governance and Searchability.
Great design. Early Stage. Nicely overlaps with Data Lake.
No extensive metadata management yet.
AWS support: https://github.com/feast-dev/feast/issues/367
By Ph.D for Ph.Ds. Tremendous amount of work,
very advanced concepts but overcomplicated.
By creators of Uber Michelangelo. Closed source.

1. Modern ML infrastructure accelerates time to value for ML initiatives and increases
trust from the business
2. Eliminates handoffs between Data Scientists, ML Engineers and IT
3. Must-have requirement for small ML shops and for large organizations. Spans from
straightforward “image classification” projects to more complex ML pipelines
4. Must-have requirement for secure and compliant environments
5. Minimizes growing technical debt in machine learning projects
6. Complements fully managed AWS services with Open Source projects for pipeline
orchestration, experiments tracking, dataset versioning, and feature store
Summary of ML Infrastructure

125 University Avenue
Suite 290, Palo Alto
California, 94301
hello@provectus.com
Questions, details?
We would be happy to answer!

MLOps and Reproducible ML on AWS with Kubeflow and Amazon SageMaker

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MLOps and Reproducible ML on AWS with Kubeflow and Amazon SageMaker

Similar to MLOps and Reproducible ML on AWS with Kubeflow and Amazon SageMaker (20)

More from Provectus

More from Provectus (20)

Recently uploaded

Recently uploaded (20)

MLOps and Reproducible ML on AWS with Kubeflow and Amazon SageMaker