GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice

Enterprise MLOps In Practice
Updated: July 2021

About MavenCode
MavenCode Confidential and Proprietary
MavenCode is a Artificial Intelligence Solutions company located Southlake, Texas - We do training, product
development and consulting services with specialization in
● Provisioning Scalable AI and ML platforms - OnPrem and in the Cloud
● Deployment & Development of Machine Learning Platforms - OnPrem and in the Cloud
● Enterprise Feature Store Development and Management
● Model Management and Governance
● Streaming Data Analytics and Edge IoT Model Deployments
● Document Understanding and Natural Language Processing with Artificial Intelligence

Industry Verticals We Serve
Retail Industry
● Recommendation Engines
● Customer Management
● Demand Analysis and Planning
● Logistics and Supply Management
Insurance Industry
● AI Infrastructure Tooling
● Claims Analysis and Processing
● Document Processing
● Damage Detection and Identification
Automotive Industry
● AI infrastructure Tooling
● Near Real Time Car Telemetry Analysis
● Preemptive maintenance
recommendation
Healthcare Industry
● Medical insurance claim analysis
● X-ray image analysis and diagnostics
● Data Driven decision making enablement
Energy Industry
● Capacity Planning and Demand
Forecasting
● Preemptive Equipment Maintenance
Travel & Hospitality Industry
● Planning and Logistics
● Customer Recommendations
● Logistics, Planning and Forecasting
Telecom Industry
● Utilization Forecasting
● Churn Rate Analysis
● Preemptive Maintenance of Equipments
Agriculture Industry
● Precision Farming
● Mechanical Utilization Rate and Planning
● Capacity Planning

Let’s Watch this Quick Video

Agenda
1 Overview of Machine Learning Ops
2 MLOps Roles
3 MLOps Landscape
4 Discuss a Use Case
5 Questions and Answers

01
Overview of MLOps

Background of MLOps
As far back as 2014, a group of Google researchers published a paper on this subject...

Interest in MLOps

MLOps is not easy!
Launching a rocket is easy, but the ongoing
operations of guiding it successfully into Space
afterward is hard

“It took me 3 weeks to develop the model. It’s been > 11 months, and it’s still
not deployed”
@ginablaber
“On average, 40% of companies said it takes more than a month to deploy
ML models into production”
thenewstack.io

Machine Learning Operations, or MLOps, helps simplify the processes involved in the deployment
of machine learning models between operations team and machine learning researchers or data
scientists in the organization
What is Machine Learning Operations?

● The goal is to standardize and streamline the Machine Learning Life Cycle management
● Is a critical component of any successful Machine Learning project in the Enterprise
● Organizations generate long term value and mitigate risk associated with Machine Learning
projects
So we can say with MLOps ...

Challenges In Enterprise ML
Reproducibility
● Not Easy to Reproduce ML Model Output
on each iterative runs
● Constantly Changing Training Data
● Consistent Environment Configuration
Issues
Reusability
● Training Pipelines are not
Componentized for Reusability
● No well defined way of doing Model
versioning and tagging
● Collaboration and sharing of source
code is not well defined
Manageability
● Managing model deployment and serving
between environments is difficult
● Versioning and Tracking model artifacts is
very difficult and complex
● No defined way to visually track updates
and changes
Automation
● A lot of deployment process is still
manual
● Steps needed to update model
parameters are not not automated
● Most data science teams are not
equipped with the right knowledge to
take models to production

02
MLOps Roles

What People Think about Machine Learning
Machine Learning Code

Hidden Technical Debt of ML Deployment
Data Verification
Configuration
Feature
Extraction
Data Validation
Machine Resource
Management
Serving
Infrastructure
Monitoring
Analysis Tool
Machine Learning Code

● Ensure a scalable and
flexible environment for ML
model pipelines
● Introduce new technologies
that improve ML model
performance in production
● Identify bottlenecks in the
production system and
pinpoint solutions for long
term improvements
ML Architects
● Analyze initial business
goals and model
outcomes
● Minimize overall risk as a
result of ML models in
production
● Ensure compliance with
internal and external
requirements before
pushing ML models to
production
Model Risk
Managers/Auditors
● Conduct and build
operational systems
● Test systems for security,
performance and
availability
● CI/CD pipeline
management
DevOps
● Integrate ML models in
company’s applications
● Ensure seamless working of
ML models with non-ML
based applications
● Maintain functional ML
models in production
ML Engineers
● Identify the right data for a
project
● Optimize the retrieval and
use of data to power ML
models
● Resolve underlying issues in
data pipelines
Data Engineers
● Build models that address
business needs
● Deliver operationalizable
models for production
environment
● Access model quality
Data Scientists
● Provide business
questions for framing ML
models
● Define business KPIs to
be achieved
● Evaluate Model
performance
Subject Matter Experts
MLOps Roles and Responsibilities

Data scientists
Model risk
managers/auditors
Subject Matter
Experts
Business Questions
Data Acquisition Feature Engineering
Data Preparation
Model
Training/Experimentation
Model Evaluation and
Comparison
Develop Models
Runtime
Environment
Risk Evaluation
QA
Scabilibility
Containerization
Continuous
Integration
Prepare for
Production
Subject Matter
Experts
Development
to Production
Logging/Alerting
Input drift tracking
Online Evaluation
Monitoring &
Feedback
Performance Drift
DevOps Data Engineers
Data Engineers
Data scientists
Software Engineers
ML Architects
Data Engineers
DevOps
1
2
3
4
ML Team Workflow
Model risk
managers/auditors

03
MLOps Landscape

Machine Learning Pipeline
Data Extraction
Data Preparation &
Analysis
Data QA and Validation
Feature Engineering
Streaming Source
Batch Job Operations
Datasource with
Streaming sources like
MQTT, Kafka, Pubsub etc
Batch Operations on
Databases, FileStorage,
Distributed Storage etc
Model
Training/Validation
Model Training
Model Serving
Model Versioning
Prediction Service
Monitoring
Logging
App
Integration
Deployment / Inferencing

Typical ML Engineer or Data Scientist Workflow
Data
Sourcing
Pre
Processing
Feature
Engineering
Model
Training /
Evaluation
Model Scoring
/Management
Model
Inferencing
Azure Storage
Google Storage
AWS S3 Storage
Raw Data Transformation Processed Data
Storage Compute
GCP Vertex AWS SageMaker Azure ML
Data Scientist / ML Engineers works
on pulling or processing data first
before starting ML training on a
Managed Cloud Service
Raw Data Processing and
Transformation Pipeline
Cloud Training Platforms
on-prem KF

Team A
Team B
Team C
Team D
Google Cloud AI
AWS SageMaker
KF on prem
Azure ML
Running ML workﬂow across
the enterprise with multiple
teams using diﬀerent Cloud
Provider technology stacks
Data
Sourcing
Pre
Processing
Feature
Engineering
Azure Storage
Google Storage
AWS S3 Storage
Raw Data Transformation Processed Data
Storage Compute
At scale, it gets complex ...

To simplify the Complexities can we abstract our ML Pipeline...
Data
Sourcing
Pre
Processing
Feature
Engineering
Model Training
/ Evaluation
Model Scoring
/Management
Model
Inferencing
Storage Compute
1 2
Feature Store
Kubernetes

To simplify the Complexities can we abstract our ML Pipeline...
Data Sourcing Pre
Processing
Feature
Engineering
Model Training /
Evaluation
ModelScoring
/Management
Model
Inferencing
Storage Compute
1 2
Feature Store
Kubeflow on Kubernetes Vertex AI
- Vertex AI Feature Store (Managed Service )
- Feast
- Databricks Feature Store

1. Feature Store In MLOps

What’s Feature Store All About
A Feature is a measurable observable attribute that is part of the input to a Machine Learning Model.
X1
X2
X3
Xn
Model Training
[Feature Vector]
Model

What’s Feature Store All About
X1
X2
X3
Xn
Model Training
[Feature Vector]
Model
Features are derived from
● Raw Datastore
● Streaming Datasource
● Aggregates of Raw Inputs
● Windows (mins, hourly, daily, weekly)

Features Change Over time!
X1
X2
X3
Xn
Model Training
X1
X2
X3
Xn
X1
X2
X3
Xn
Time

Feature Stores In MLOps
● Makes it easy to operationalize our ML workload, most importantly Data Management and Storage for
Model training
● Features can be shared easily among teams running different Model training pipelines
● We can get to version of datasets and track changes easily
● Consistency in Feature input attributes between Model Training and Serving

Getting Data into a Feature Store
import kfp
from kfp import components
KafkaDatastreamer_op =
kfp.components.create_component_from_func(KafkaDatastreamer,base_image="python:3.7.1”)
ValidatorOnSchema_op =
kfp.components.create_component_from_func(ValidatorOnSchema,base_image="python:3.7.1")
PreProcessor_op =
kfp.components.create_component_from_func(PreProcessor,base_image="python:3.7.1")
FeatureStoreWriter_op= kfp.components.create_component_from_func(FeatureStoreWriter,
base_image="mavencode.io/spark:v3.1.1")

2. Kubeflow for MLOps

Why Machine Learning with Kubeflow?
With Kubeflow out of the box on Kubernetes, we can easily have
Composability Portability
Scalability

What is Kubeflow
● Machine learning toolkit for Kubernetes.
● Platform to productionize ML models, making them simple, scalable and
reliable.
● Collection of Cloud native tools for all the stages of a model development
life cycle.
● Build integrated end-to-end pipelines which connect all the stages of a
model development life cycle.

Simply Put ...
Kubeflow Simplifies your Model Development Life Cycle (MDLC)

Kubeflow Overview
Chainer Jupyter
MPI Scikit-Learn
Pytorch Tensorflow
MXNet XGBoost
ML Tools
Kubeflow
Applications
Jupyter
Notebook
Chainer
Operator
MPI Operator
MXNet
Operator
Pytorch
Operator
TFJob
Operator
XGBoost
Operator
Hyperparameter Tuning
(Katib)
Fairing
Metadata
Pipelines
Kubeﬂow UI
KFServing
Tensorflow Batch
Prediction
Pytorch Serving
Tensorflow
Serving
SeldonCore
Serving
Knative
Serving
Istio
Argo
Prometheus
Kubernetes

Kubeflow Overview

3
1
2
Enterprise Machine Learning with Kubeflow
MLOps Training and Deployment Platform
In-Cluster Traffic Control By ISTIO -
RBAC, Access UI With SSO Identity
Compatible Proxy
Kubeflow Jupyter NoteBook Kubeflow Jupyter NoteBook Kubeflow Jupyter NoteBook Kubeflow Jupyter NoteBook
Kubeflow Managed Model
Infrastructure
Namespace - Bob Namespace - Dav Namespace - Chuck Namespace - Team
Data Scientist 1 Data Scientist 2 Data Scientist 3
Data Science Team
Authentication and
Authorization
Auto-Scalable CPU Node Pool Auto-Scalable GPU Node Pool

Vertex AI
https://codelabs.developers.google.com/vertex-pipelines-intro#6

04
Let’s go through a Scenario

Airline Customer Prediction
● The Dataset is from Kaggle.
● The data is from an airline organization whose actual name is not given for
various reasons, therefore, the airline is given the pseudonym Invistico airlines.
● The dataset consists of (23 columns and 129880 entries) details of customers
who have already flown with them.
Data Scientists
Subject Matter
Experts

Problem Statement
Customer satisfaction is priority in the airline industry.
Unhappy or disengaged customers naturally mean fewer passengers and less revenue.
As satisfaction is rarely solely about the flight itself but also the experience from booking to landing, this scenario is aimed
at building a machine learning model using all salient features in the data to predict customer satisfaction.

Data Analysis
Data Scientists
Subject Matter
Experts

Customers on business class seats were the most satisfied.
The dataset showed more satisfied customers than otherwise, with 54.7% of
the surveyed customers reporting satisfaction with their experiences
Exploratory Data Analysis
There were more female travelers than males and more females
reported satisfaction with their experiences.
Most customers travelled for business purposes and satisfaction was
higher in business travelers.

Heatmap showing Feature
Correlation
Data Scientists
Subject Matter
Experts

Feature Engineering
Data Scientists
Data Engineers

Feature Engineering
To make the data fit four our machine learning model, we performed the
following feature engineering steps:
1. Removing outliers
2. Dropping rows with null values
3. Dropping and combining columns with little or no correlation with our
variable
4. Converting Categorical features to numbers
Data Scientists
Data Engineers

Before Outlier Removal After Outlier Removal
Feature Engineering: Outlier Removal

Feature Engineering Data Pipeline
● Load data: reads data from source.
● Dataset Statistics: displays summary statistics of the data.
● Dataset Schema: automatically generates a schema by
inferring types, categories, and ranges from the data.
● Dataset Validation: uses the inferred schema to detect
anomalies in the data.
● Feature Engineering: performs necessary preprocessing
and feature engineering steps on the dataset.

Model Training with ML Operators on
Kubeflow

● An ML operator helps to deploy, monitor and manage the
lifecycle of a training job.
● Kubeflow Operators Include
○ Tf-operator
○ Pytorch-operator,
○ Xgboost-operator
○ MPI-operator and many more which can be found on
the official kubeflow account.
ML Operators - Overview

Model Training with Tensorflow Operator
● Tensorflow Operator is one of the operators offered by Kubeflow to make it easy to run and
monitor both distributed and non-distributed tensorflow jobs on Kubernetes.
● Training tensorflow models using tf-operator relies on centralized parameter servers for
coordination between workers. It supports the tensorflow framework only.
● After preprocessing our data, we built a tensorflow neural network model.
● Our tensorflow model had an accuracy of approximately 88%.

Hyperparameter Tuning
Model Risk
Managers/Auditors
ML Engineers
Data Scientists

Hyperparameters: Configuration and variable values that are external to the model, the values are always
set before model training process begin
Selecting the right Hyperparameters can significantly improve model performance in production
Hyperparameter Tuning: Is all about finding hyperparameter input values that optimizes the objective
function of the model training
What is Hyperparameter Tuning?
(a1, b1, c1,.....zN)
(a2, b2, c2,.....zN)
(a3, b3, c3,.....zN)

What is Hyperparameter Tuning?
ml.trainModel(layers=10. batch=20. learning_rate=0.2)
Hyperparameters Parameters Score
layers=13. batch=12. learning_rate=0.2
weight optimization
weight optimization
weight optimization
weight optimization
weight optimization
Score. 85
Score. 89
Score. 94
Score. 91
Score. 81

Manually tuning by Hand is very inefficient, error-prone and difficult to track
Capturing metrics across multiple jobs and comparing them is difficult!
Efficiently allocating resources and infrastructure on the Cluster to handle all the job runs is not an easy
task
As more Hyperparameters are added, the combinatorial search space of possible inputs to maximize the
training objective function grows exponentially!
Hyperparameter Tuning is Hard!

Hyperparameter Tuning with Katib on Kubeflow
Katib is the Hyperparameter tuning component of Kubeflow
It is Language and Framework Agnostic
- Tensorflow
- Pytorch
- MxNet
- XGBoost
Customizable Hyperparameter Search space Algorithm
- Random Search
- Grid search
- Bayesian Optimization
- Hyperband

1. Experiment: An experiment is a single tuning run, also called an optimization run. You specify configuration
settings to define the experiment. The following are the main configurations:
● Objective: What you intend to optimize. This is the objective metric, also called the target variable.
● Search Space: The set of all possible hyperparameter values that the hyperparameter tuning job
should consider for optimization, and the constraints for each hyperparameter.
● Search Algorithm: The algorithm to use when searching for the optimal hyperparameter values.
Katib Concepts

Hyperparameter Tuning with Katib
Katib automates the Hyperparameter Tuning
process by running a pre-configured number of
training jobs (known as trials) in parallel.

Result of Katib Experiment
With katib hyperparameter tuning, accuracy increased from 88% to 92.1%

Model Serving with KFServing
● KFServing is Kubeflow’s model deployment
and serving toolkit
● To efficiently serve our model using
KfServing, we built a Kubeflow pipeline to
load data, preprocess, train the model, make
predictions, export and serve the model.

Enterprise ML
Operationalization Goal

End to End ML Operationalization Process

Model Development Life Cycle (Data Scientist View)
Data Information Knowledge Insight
Data Scientist workflow essentially follows this path ...

Machine Learning Development Life Cycle (Production Deployment)
Model Training
T
r
a
i
n
i
n
g
D
a
t
a
E
T
L
Tuning
Inferencing
S
e
r
v
i
n
g
M
o
n
i
t
o
r
i
n
g
Update

GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice

Similar to GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice (20)

More from James Anderson

More from James Anderson (20)

Recently uploaded

Recently uploaded (20)

GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice