End to end MLworkflows

End to End AI Workflows
Skymind

Schedule
●9:30-10:00: Doors Open
●10:00-11:20: Workflow Overview
● 11:20-11:30: Break
●11:30-12:00: Hands on walkthrough of training models
●12:00-13:00: Lunch
●13:00-13:50: Deployment Scenarios/Considerations
● 13:50-14:00: Break
●14:00-14:50: Serving models Hands on
●15:00-15:50: DIfferent ways of Serving models overview
●16:00-16:15: Wrap

Workflow Overview● Organize and define issues and problems
● Organize effects
● Feasibility Study
● Experimental design
● Data collection
● Data organization / analysis / pre-processing
● Build a baseline model
● Redo preprocessing again, improve the model
● Tuning
● Evaluation
● Environment construction & deployment
● Post-deployment monitoring and model relearning
※ We will provide training that covers all of these in
detail.

| Challenges of ML in Enterprise
● Different teams have different technology preferences
● Experimentation deep learning framework not necessarily makes good production
frameworks
● Data scientists are great at experimentation, not so much in implementing production-ready
models and devops that focus
● Engineering/DevOps team rewrites models for production environment
Engineering,
Devops Team
Data Science Team
● Write deep learning
workflows and models in
notebooks or other
development IDE
● Refactor or rewrite models
for production environments
● Automate training and
optimization jobs
● Deploy models

Tools Overview
| Ecosystem
ML, DL Frameworks
Model Development Workflows
Runtime
Keras Tensorflow Deeplearning4J Scikit-Learn ...
Others
Feature Extraction Model Import Model Training
Hyperparameter
Optimization
...
Retraining
Model Performance
Monitoring
Model Versioning Job Scheduling ...

● More than a REST API
● Model calibration and input outlier
detection
● Monitor inputs and adapt to changes
in evolving data
Managing AI models over time
Human-In-The-LoopA/B Testing
Performance Monitoring
LEARNING LOOP
IN PRODUCTION
AI Model Decisions Retraining
Hot Swap

Scoping a Project
● Organize and define issues and problems
● Organize effects
● Feasibility Study
● Experimental design

● It usually takes just a few weeks, and it's quick.
● Isn't someone doing similar tasks somewhere?
● How much is it possible?
● What data do you prepare, what techniques do you use,
and how do you evaluate it?
● What kind of team configuration do you implement? What
skills do you need?
● What kind of infrastructure do you need?
| Feasibility study, experimental design

| Data
● Data Collection
● Data organization / analysis / pre-processing
● EDA
● Data Quality Assessment

| ETL/Data Collection
●Understand your Data Sources
●Do you have enough of the right kind of data?
●Can you access the data? (regulations)
●What is your vectorization pipeline?
●What are the expected data volumes we accumulate over
time? Per Day? Per month?

| Modeling
● Build Baseline Model
● Refine and Tune
● Try Different Architectures after baseline
● Rinse and Repeat as necessary evaluating your model
based on at minimum a train/test/validation split

| About model evaluation
● Is it okay to always measure with the same accuracy?
○ Diagnostic model in medical field vs automated driving vs customer
demand forecast ... etc
● It is necessary to change the evaluation method according to the problem
to be solved
● It is necessary to change the evaluation method according to the problem
to be solved
○ About Data Split
○ About the distribution of data

| Difference between learning
| and production use (Inference)
● Basically in a production environment, ML models do not update
● Models updated during training
● In a production environment, AI models freeze
○ Load a learned model from disk in advance and expand it in memory so
that it can respond to requests at any time

| Evaluation Data
● Always understand your training data
● What kind of evaluation data do you need?
● Data actually used by the company (production data)
● How much do you need?
● The more the better
● Pay particular attention to seasonality when collecting evaluation data

| Preparation of Evaluation Data
● What is the generalization performance?
● Why is generalization performance important?
● How do you measure it?
Trai
n
Dev Tes
t

| Deployment
● Environment Construction and Setup
● Intended usage scenarios (Batch, Real time,..)
● Resource usage understanding/provisioning
● Post-deployment monitoring and model relearning

Deploy ML
Model learning
(training)
AI server
Model
Config
disk
定義ファイル
重みファイル
Model
learning
(training)
Definition
weight
※ The weight file is small
Dozens of megabytes, large ones are close to
GB.
※ In advance to perform real-time processing
Deploy in memory and process request
I need to wait.
Model
learning
(training)
Cloud API too
In principle it is the same!

| About AI execution environment
● Local machine
● Embedded (raspberry pi, phone, ..)
● Python script
● On prem server as part of application
● AI Workflow Platform(SKIL, Michelangelo, FB Learner, Sagemaker,..)

Model Deployment &
maintenance
What does it mean to deploy a learned model?
Where is the need for model maintenance?
About the last step left after getting a good model
I will introduce it.

| Monitoring After Deployment
● Why do we need monitoring?
● Concept Drift
● Example：
○ Marketing from the customer's online shopping behavior
○ Fraud detection
○ Data with Seasonality
※ As a premise, understand that data is ever changing

Human in the loop
Thinking
AI model assumes that it will never be 100%
Work on problem solving.
Even if the accuracy is not 100%, the effect can be achieved.
One such method is the idea of Human in the loop.

| Human-in-the-loop concept
● Instead of aiming for 100% accuracy (generally not possible)
● Have a recovery plan for when models fail. Prefer feedback.
input AI model Decision
Right?
Good.
Wrong?
Feedback.

| Calibrate your Models
● Requirements to realize Human in the loop:
● Many recent Deep Learning models are not correctly calibrated.
● Proper Confidence for Model Predictions
● Mandatory to determine whether to intervene in workflow

| A bit more on Concept Drift
A phenomenon in which the characteristics of data continue to change with
time, and the characteristics of the data used for learning and the latest data
differ.
Example of data with Seasonality
● Customer Movement in Online Shop
○ Affects online advertising, promotion decisions.
● Product design document data
○ The format of the design document, the contents of the description, etc.
keep changing, which affects the reading accuracy.

| Countermeasures for Concept Drift
● In the day-to-day operation, store feedback from the operators.
● Perform continuous / periodic model performance checks and relearning
with newly stored data.
● Continuously monitor a key metric for frequency of feedback
● Sometimes batch retraining or tuning the model might be better

Virtual
Machine
(VM)
Bare Metal
Container
Host
Primitives
Configuration Manager
Hypervisor
(Example: Xen, KVM)
Automation Engine
(Example: Ansible)
Orchestrator
(Example: Kubernetes,
Marathon)
Automation platform perform configuration management,
software provisioning and application deployment.
Virtualization infrastructure create, manage and run one or
more virtual machines on a host machine
Orchestration platform manage, provision and scale
containerized applications
| Different Types of Hosts

● A server comprised of a set of hosts.
● Each server can be a combination of bare metal servers, virtual machine and
containers.
● Each host has a cluster membership indicating which cluster does it belongs
to.
| Server Virtualization

Single Server Multiple Servers Hybrid Cloud
| Conﬁguration
● High Availability Using ZooKeeper
● Distributed Training and Inference on Multi-GPUs via Spark
● Integration with JVM ecosystem (Hadoop)
● Many SKIL server components are also embeddable in Java applications

Architecture
● Simple architecture for getting
started
● Cost effective solution for customers
with less than 100GB of Data
● Perfect for a DGX-1 Type system
CPU
OS
GPU
ZooKeeper
SKIL Training WorkspacesSKIL Deployments
Data Exploration / Training
SKIL Data Connectors
| Single Node

Scaled Out
Training Cluster
Architecture
● Any midrange VM or dedicated machine for
Zookeeper
● 1 or more Multi-GPU systems (DGX class or
　similar) for SKIL
● Gluster/HDFS provides global file system for data
| Multi-Node Training Cluster

Hybrid Cloud Cluster
(Cloud Data Storage)
Architecture
● DGX-1 Servers for SKIL with 8 P100/V100 GPUs
● Existing Hadoop cluster is used by SKIL for
○ ETL (Preparing data for training on GPU) or
○ Batch Inference for distributed scoring
with trained models.
| HYBRID CLOUD
Amazon s3
Azure
Blob

Architecture
● DGX-1 Servers for SKIL with 8 P100/V100 GPUs
● Private / public clouds such as
AWS EC2, Azure VM to serve models
| HYBRID CLOUD
Hybrid Cloud Cluster
(Cloud Server)
AWS EC2 GCP

GPU Training Cluster Architecture
● Powerful GPU Servers or Spark Cluster for training
models
● Separate (multiple) deployments-only clusters for
production deployments of ML models as REST APIs
CPU Inference Cluster
| MULTI-CLUSTER

Edge Inference Cluster
| Edge Deployment
IOT Device IOT Device
SKIL
IOT Device
public /
private cloud
IOT Device IOT Device IOT Device
on-premise cluster
SKIL
SKILSKIL
Architecture
● Inference purpose on edge devices
● (Optional) Retraining on powerful on-premise
server/cloud cluster
○ MHS tracks the performance metrics to prompt retraining

Edge Deployment Cluster
(Training, Inference)
| Edge Deployment
Embedded
System
Architecture
● SKIL deployed on edge devices for
○ training of lightweight models
○ retraining of model / transfer learning
● Alternatively, models can be trained on central server.
Agents on end devices swap models in and out of the
devices
● Central SKIL server to navigate
and coordinate between the edge devices
SKIL
SKIL
SKIL
Central SKIL server
Robot
SKIL
Tablet
SKIL
Car
SKIL

Various deployment methods
Local machine
(Execution on RPA robot execution
machine)
On prem server Cloud API
・ Low introduction cost
・ The period until
introduction is short
・ No need for model
preparation
・ Environment construction is
easy
・ Resolvable issues are
limited
・ Customization is difficult
·Pay-per-use
・ Low introduction cost
・ No need for additional
infrastructure investment
・ The period until introduction is
short
・ Flexible response to individual
cases
・ Customizable flexibly
・ It is troublesome for individual
environment construction
・ Scalability is low
・ High maintenance cost
・ Machine specification is low
・ Flexible response to individual
cases
・ Customizable flexibly
・ Scalability is high
・ Machine spec high
・ There is infrastructure
investment
・ Individual environment
construction is relatively easy
・ High maintenance cost

| Other considerations
● Precision/Memory Trade off for Models
● Model Compression for large models/constrained environments
● Hardware support (TPU does not support all ops, certain things only run on
ARM/Intel,..)
● Model Quantization is useful. Optimization becoming more common

End to end MLworkflows

Recommended

Recommended

More Related Content

Similar to End to end MLworkflows

Similar to End to end MLworkflows (20)

More from Adam Gibson

More from Adam Gibson (20)

Recently uploaded

Recently uploaded (20)

End to end MLworkflows