End to End AI Workflows
Skymind
Skymind Overview
Schedule
●9:30-10:00: Doors Open
●10:00-11:20: Workflow Overview
● 11:20-11:30: Break
●11:30-12:00: Hands on walkthrough of training models
●12:00-13:00: Lunch
●13:00-13:50: Deployment Scenarios/Considerations
● 13:50-14:00: Break
●14:00-14:50: Serving models Hands on
●15:00-15:50: DIfferent ways of Serving models overview
●16:00-16:15: Wrap
Workflow Overview● Organize and define issues and problems
● Organize effects
● Feasibility Study
● Experimental design
● Data collection
● Data organization / analysis / pre-processing
● Build a baseline model
● Redo preprocessing again, improve the model
● Tuning
● Evaluation
● Environment construction & deployment
● Post-deployment monitoring and model relearning
※ We will provide training that covers all of these in
detail.
| Challenges of ML in Enterprise
● Different teams have different technology preferences
● Experimentation deep learning framework not necessarily makes good production
frameworks
● Data scientists are great at experimentation, not so much in implementing production-ready
models and devops that focus
● Engineering/DevOps team rewrites models for production environment
Engineering,
Devops Team
Data Science Team
● Write deep learning
workflows and models in
notebooks or other
development IDE
● Refactor or rewrite models
for production environments
● Automate training and
optimization jobs
● Deploy models
Tools Overview
| Ecosystem
ML, DL Frameworks
Model Development Workflows
Runtime
Keras Tensorflow Deeplearning4J Scikit-Learn ...
Others
Feature Extraction Model Import Model Training
Hyperparameter
Optimization
...
Retraining
Model Performance
Monitoring
Model Versioning Job Scheduling ...
● More than a REST API
● Model calibration and input outlier
detection
● Monitor inputs and adapt to changes
in evolving data
Managing AI models over time
Human-In-The-LoopA/B Testing
Performance Monitoring
LEARNING LOOP
IN PRODUCTION
AI Model Decisions Retraining
Hot Swap
Scoping a Project
● Organize and define issues and problems
● Organize effects
● Feasibility Study
● Experimental design
● It usually takes just a few weeks, and it's quick.
● Isn't someone doing similar tasks somewhere?
● How much is it possible?
● What data do you prepare, what techniques do you use,
and how do you evaluate it?
● What kind of team configuration do you implement? What
skills do you need?
● What kind of infrastructure do you need?
| Feasibility study, experimental design
| Data
● Data Collection
● Data organization / analysis / pre-processing
● EDA
● Data Quality Assessment
| ETL/Data Collection
●Understand your Data Sources
●Do you have enough of the right kind of data?
●Can you access the data? (regulations)
●What is your vectorization pipeline?
●What are the expected data volumes we accumulate over
time? Per Day? Per month?
| Modeling
● Build Baseline Model
● Refine and Tune
● Try Different Architectures after baseline
● Rinse and Repeat as necessary evaluating your model
based on at minimum a train/test/validation split
| About model evaluation
● Is it okay to always measure with the same accuracy?
○ Diagnostic model in medical field vs automated driving vs customer
demand forecast ... etc
● It is necessary to change the evaluation method according to the problem
to be solved
● It is necessary to change the evaluation method according to the problem
to be solved
○ About Data Split
○ About the distribution of data
| Difference between learning
| and production use (Inference)
● Basically in a production environment, ML models do not update
● Models updated during training
● In a production environment, AI models freeze
○ Load a learned model from disk in advance and expand it in memory so
that it can respond to requests at any time
| Evaluation Data
● Always understand your training data
● What kind of evaluation data do you need?
● Data actually used by the company (production data)
● How much do you need?
● The more the better
● Pay particular attention to seasonality when collecting evaluation data
| Preparation of Evaluation Data
● What is the generalization performance?
● Why is generalization performance important?
● How do you measure it?
Trai
n
Dev Tes
t
| Deployment
● Environment Construction and Setup
● Intended usage scenarios (Batch, Real time,..)
● Resource usage understanding/provisioning
● Post-deployment monitoring and model relearning
About deployment
Deploy ML
Model learning
(training)
AI server
Model
Config
disk
定義ファイル
重みファイル
Model
learning
(training)
Definition
weight
※ The weight file is small
Dozens of megabytes, large ones are close to
GB.
※ In advance to perform real-time processing
Deploy in memory and process request
I need to wait.
Model
learning
(training)
Cloud API too
In principle it is the same!
| About AI execution environment
● Local machine
● Embedded (raspberry pi, phone, ..)
● Python script
● On prem server as part of application
● AI Workflow Platform(SKIL, Michelangelo, FB Learner, Sagemaker,..)
Model Deployment &
maintenance
What does it mean to deploy a learned model?
Where is the need for model maintenance?
About the last step left after getting a good model
I will introduce it.
| Monitoring After Deployment
● Why do we need monitoring?
● Concept Drift
● Example:
○ Marketing from the customer's online shopping behavior
○ Fraud detection
○ Data with Seasonality
※ As a premise, understand that data is ever changing
Human in the loop
Thinking
AI model assumes that it will never be 100%
Work on problem solving.
Even if the accuracy is not 100%, the effect can be achieved.
One such method is the idea of Human in the loop.
| Human-in-the-loop concept
● Instead of aiming for 100% accuracy (generally not possible)
● Have a recovery plan for when models fail. Prefer feedback.
input AI model Decision
Right?
Good.
Wrong?
Feedback.
| Calibrate your Models
● Requirements to realize Human in the loop:
● Many recent Deep Learning models are not correctly calibrated.
● Proper Confidence for Model Predictions
● Mandatory to determine whether to intervene in workflow
Model maintenance
| A bit more on Concept Drift
A phenomenon in which the characteristics of data continue to change with
time, and the characteristics of the data used for learning and the latest data
differ.
Example of data with Seasonality
● Customer Movement in Online Shop
○ Affects online advertising, promotion decisions.
● Product design document data
○ The format of the design document, the contents of the description, etc.
keep changing, which affects the reading accuracy.
| Countermeasures for Concept Drift
● In the day-to-day operation, store feedback from the operators.
● Perform continuous / periodic model performance checks and relearning
with newly stored data.
● Continuously monitor a key metric for frequency of feedback
● Sometimes batch retraining or tuning the model might be better
Deployment Scenarios
Virtual
Machine
(VM)
Bare Metal
Container
Host
Primitives
Configuration Manager
Hypervisor
(Example: Xen, KVM)
Automation Engine
(Example: Ansible)
Orchestrator
(Example: Kubernetes,
Marathon)
Automation platform perform configuration management,
software provisioning and application deployment.
Virtualization infrastructure create, manage and run one or
more virtual machines on a host machine
Orchestration platform manage, provision and scale
containerized applications
| Different Types of Hosts
● A server comprised of a set of hosts.
● Each server can be a combination of bare metal servers, virtual machine and
containers.
● Each host has a cluster membership indicating which cluster does it belongs
to.
| Server Virtualization
Single Server Multiple Servers Hybrid Cloud
| Configuration
● High Availability Using ZooKeeper
● Distributed Training and Inference on Multi-GPUs via Spark
● Integration with JVM ecosystem (Hadoop)
● Many SKIL server components are also embeddable in Java applications
Architecture
● Simple architecture for getting
started
● Cost effective solution for customers
with less than 100GB of Data
● Perfect for a DGX-1 Type system
CPU
OS
GPU
ZooKeeper
SKIL Training WorkspacesSKIL Deployments
Data Exploration / Training
SKIL Data Connectors
| Single Node
Scaled Out
Training Cluster
Architecture
● Any midrange VM or dedicated machine for
Zookeeper
● 1 or more Multi-GPU systems (DGX class or
 similar) for SKIL
● Gluster/HDFS provides global file system for data
| Multi-Node Training Cluster
Hybrid Cloud Cluster
(Cloud Data Storage)
Architecture
● DGX-1 Servers for SKIL with 8 P100/V100 GPUs
● Existing Hadoop cluster is used by SKIL for
○ ETL (Preparing data for training on GPU) or
○ Batch Inference for distributed scoring
with trained models.
| HYBRID CLOUD
Amazon s3
Azure
Blob
Architecture
● DGX-1 Servers for SKIL with 8 P100/V100 GPUs
● Private / public clouds such as
AWS EC2, Azure VM to serve models
| HYBRID CLOUD
Hybrid Cloud Cluster
(Cloud Server)
AWS EC2 GCP
GPU Training Cluster Architecture
● Powerful GPU Servers or Spark Cluster for training
models
● Separate (multiple) deployments-only clusters for
production deployments of ML models as REST APIs
CPU Inference Cluster
| MULTI-CLUSTER
Edge Inference Cluster
| Edge Deployment
IOT Device IOT Device
SKIL
IOT Device
public /
private cloud
IOT Device IOT Device IOT Device
on-premise cluster
SKIL
SKILSKIL
Architecture
● Inference purpose on edge devices
● (Optional) Retraining on powerful on-premise
server/cloud cluster
○ MHS tracks the performance metrics to prompt retraining
Edge Deployment Cluster
(Training, Inference)
| Edge Deployment
Embedded
System
Architecture
● SKIL deployed on edge devices for
○ training of lightweight models
○ retraining of model / transfer learning
● Alternatively, models can be trained on central server.
Agents on end devices swap models in and out of the
devices
● Central SKIL server to navigate
and coordinate between the edge devices
SKIL
SKIL
SKIL
Central SKIL server
Robot
SKIL
Tablet
SKIL
Car
SKIL
Various deployment methods
Local machine
(Execution on RPA robot execution
machine)
On prem server Cloud API
・ Low introduction cost
・ The period until
introduction is short
・ No need for model
preparation
・ Environment construction is
easy
・ Resolvable issues are
limited
・ Customization is difficult
·Pay-per-use
・ Low introduction cost
・ No need for additional
infrastructure investment
・ The period until introduction is
short
・ Flexible response to individual
cases
・ Customizable flexibly
・ It is troublesome for individual
environment construction
・ Scalability is low
・ High maintenance cost
・ Machine specification is low
・ Flexible response to individual
cases
・ Customizable flexibly
・ Scalability is high
・ Machine spec high
・ There is infrastructure
investment
・ Individual environment
construction is relatively easy
・ High maintenance cost
| Other considerations
● Precision/Memory Trade off for Models
● Model Compression for large models/constrained environments
● Hardware support (TPU does not support all ops, certain things only run on
ARM/Intel,..)
● Model Quantization is useful. Optimization becoming more common

End to end MLworkflows

  • 1.
    End to EndAI Workflows Skymind
  • 2.
  • 3.
    Schedule ●9:30-10:00: Doors Open ●10:00-11:20:Workflow Overview ● 11:20-11:30: Break ●11:30-12:00: Hands on walkthrough of training models ●12:00-13:00: Lunch ●13:00-13:50: Deployment Scenarios/Considerations ● 13:50-14:00: Break ●14:00-14:50: Serving models Hands on ●15:00-15:50: DIfferent ways of Serving models overview ●16:00-16:15: Wrap
  • 4.
    Workflow Overview● Organizeand define issues and problems ● Organize effects ● Feasibility Study ● Experimental design ● Data collection ● Data organization / analysis / pre-processing ● Build a baseline model ● Redo preprocessing again, improve the model ● Tuning ● Evaluation ● Environment construction & deployment ● Post-deployment monitoring and model relearning ※ We will provide training that covers all of these in detail.
  • 5.
    | Challenges ofML in Enterprise ● Different teams have different technology preferences ● Experimentation deep learning framework not necessarily makes good production frameworks ● Data scientists are great at experimentation, not so much in implementing production-ready models and devops that focus ● Engineering/DevOps team rewrites models for production environment Engineering, Devops Team Data Science Team ● Write deep learning workflows and models in notebooks or other development IDE ● Refactor or rewrite models for production environments ● Automate training and optimization jobs ● Deploy models
  • 6.
    Tools Overview | Ecosystem ML,DL Frameworks Model Development Workflows Runtime Keras Tensorflow Deeplearning4J Scikit-Learn ... Others Feature Extraction Model Import Model Training Hyperparameter Optimization ... Retraining Model Performance Monitoring Model Versioning Job Scheduling ...
  • 7.
    ● More thana REST API ● Model calibration and input outlier detection ● Monitor inputs and adapt to changes in evolving data Managing AI models over time Human-In-The-LoopA/B Testing Performance Monitoring LEARNING LOOP IN PRODUCTION AI Model Decisions Retraining Hot Swap
  • 8.
    Scoping a Project ●Organize and define issues and problems ● Organize effects ● Feasibility Study ● Experimental design
  • 9.
    ● It usuallytakes just a few weeks, and it's quick. ● Isn't someone doing similar tasks somewhere? ● How much is it possible? ● What data do you prepare, what techniques do you use, and how do you evaluate it? ● What kind of team configuration do you implement? What skills do you need? ● What kind of infrastructure do you need? | Feasibility study, experimental design
  • 10.
    | Data ● DataCollection ● Data organization / analysis / pre-processing ● EDA ● Data Quality Assessment
  • 11.
    | ETL/Data Collection ●Understandyour Data Sources ●Do you have enough of the right kind of data? ●Can you access the data? (regulations) ●What is your vectorization pipeline? ●What are the expected data volumes we accumulate over time? Per Day? Per month?
  • 12.
    | Modeling ● BuildBaseline Model ● Refine and Tune ● Try Different Architectures after baseline ● Rinse and Repeat as necessary evaluating your model based on at minimum a train/test/validation split
  • 13.
    | About modelevaluation ● Is it okay to always measure with the same accuracy? ○ Diagnostic model in medical field vs automated driving vs customer demand forecast ... etc ● It is necessary to change the evaluation method according to the problem to be solved ● It is necessary to change the evaluation method according to the problem to be solved ○ About Data Split ○ About the distribution of data
  • 14.
    | Difference betweenlearning | and production use (Inference) ● Basically in a production environment, ML models do not update ● Models updated during training ● In a production environment, AI models freeze ○ Load a learned model from disk in advance and expand it in memory so that it can respond to requests at any time
  • 15.
    | Evaluation Data ●Always understand your training data ● What kind of evaluation data do you need? ● Data actually used by the company (production data) ● How much do you need? ● The more the better ● Pay particular attention to seasonality when collecting evaluation data
  • 16.
    | Preparation ofEvaluation Data ● What is the generalization performance? ● Why is generalization performance important? ● How do you measure it? Trai n Dev Tes t
  • 17.
    | Deployment ● EnvironmentConstruction and Setup ● Intended usage scenarios (Batch, Real time,..) ● Resource usage understanding/provisioning ● Post-deployment monitoring and model relearning
  • 18.
  • 19.
    Deploy ML Model learning (training) AIserver Model Config disk 定義ファイル 重みファイル Model learning (training) Definition weight ※ The weight file is small Dozens of megabytes, large ones are close to GB. ※ In advance to perform real-time processing Deploy in memory and process request I need to wait. Model learning (training) Cloud API too In principle it is the same!
  • 20.
    | About AIexecution environment ● Local machine ● Embedded (raspberry pi, phone, ..) ● Python script ● On prem server as part of application ● AI Workflow Platform(SKIL, Michelangelo, FB Learner, Sagemaker,..)
  • 21.
    Model Deployment & maintenance Whatdoes it mean to deploy a learned model? Where is the need for model maintenance? About the last step left after getting a good model I will introduce it.
  • 22.
    | Monitoring AfterDeployment ● Why do we need monitoring? ● Concept Drift ● Example: ○ Marketing from the customer's online shopping behavior ○ Fraud detection ○ Data with Seasonality ※ As a premise, understand that data is ever changing
  • 23.
    Human in theloop Thinking AI model assumes that it will never be 100% Work on problem solving. Even if the accuracy is not 100%, the effect can be achieved. One such method is the idea of Human in the loop.
  • 24.
    | Human-in-the-loop concept ●Instead of aiming for 100% accuracy (generally not possible) ● Have a recovery plan for when models fail. Prefer feedback. input AI model Decision Right? Good. Wrong? Feedback.
  • 25.
    | Calibrate yourModels ● Requirements to realize Human in the loop: ● Many recent Deep Learning models are not correctly calibrated. ● Proper Confidence for Model Predictions ● Mandatory to determine whether to intervene in workflow
  • 26.
  • 27.
    | A bitmore on Concept Drift A phenomenon in which the characteristics of data continue to change with time, and the characteristics of the data used for learning and the latest data differ. Example of data with Seasonality ● Customer Movement in Online Shop ○ Affects online advertising, promotion decisions. ● Product design document data ○ The format of the design document, the contents of the description, etc. keep changing, which affects the reading accuracy.
  • 28.
    | Countermeasures forConcept Drift ● In the day-to-day operation, store feedback from the operators. ● Perform continuous / periodic model performance checks and relearning with newly stored data. ● Continuously monitor a key metric for frequency of feedback ● Sometimes batch retraining or tuning the model might be better
  • 29.
  • 30.
    Virtual Machine (VM) Bare Metal Container Host Primitives Configuration Manager Hypervisor (Example:Xen, KVM) Automation Engine (Example: Ansible) Orchestrator (Example: Kubernetes, Marathon) Automation platform perform configuration management, software provisioning and application deployment. Virtualization infrastructure create, manage and run one or more virtual machines on a host machine Orchestration platform manage, provision and scale containerized applications | Different Types of Hosts
  • 31.
    ● A servercomprised of a set of hosts. ● Each server can be a combination of bare metal servers, virtual machine and containers. ● Each host has a cluster membership indicating which cluster does it belongs to. | Server Virtualization
  • 32.
    Single Server MultipleServers Hybrid Cloud | Configuration ● High Availability Using ZooKeeper ● Distributed Training and Inference on Multi-GPUs via Spark ● Integration with JVM ecosystem (Hadoop) ● Many SKIL server components are also embeddable in Java applications
  • 33.
    Architecture ● Simple architecturefor getting started ● Cost effective solution for customers with less than 100GB of Data ● Perfect for a DGX-1 Type system CPU OS GPU ZooKeeper SKIL Training WorkspacesSKIL Deployments Data Exploration / Training SKIL Data Connectors | Single Node
  • 34.
    Scaled Out Training Cluster Architecture ●Any midrange VM or dedicated machine for Zookeeper ● 1 or more Multi-GPU systems (DGX class or  similar) for SKIL ● Gluster/HDFS provides global file system for data | Multi-Node Training Cluster
  • 35.
    Hybrid Cloud Cluster (CloudData Storage) Architecture ● DGX-1 Servers for SKIL with 8 P100/V100 GPUs ● Existing Hadoop cluster is used by SKIL for ○ ETL (Preparing data for training on GPU) or ○ Batch Inference for distributed scoring with trained models. | HYBRID CLOUD Amazon s3 Azure Blob
  • 36.
    Architecture ● DGX-1 Serversfor SKIL with 8 P100/V100 GPUs ● Private / public clouds such as AWS EC2, Azure VM to serve models | HYBRID CLOUD Hybrid Cloud Cluster (Cloud Server) AWS EC2 GCP
  • 37.
    GPU Training ClusterArchitecture ● Powerful GPU Servers or Spark Cluster for training models ● Separate (multiple) deployments-only clusters for production deployments of ML models as REST APIs CPU Inference Cluster | MULTI-CLUSTER
  • 38.
    Edge Inference Cluster |Edge Deployment IOT Device IOT Device SKIL IOT Device public / private cloud IOT Device IOT Device IOT Device on-premise cluster SKIL SKILSKIL Architecture ● Inference purpose on edge devices ● (Optional) Retraining on powerful on-premise server/cloud cluster ○ MHS tracks the performance metrics to prompt retraining
  • 39.
    Edge Deployment Cluster (Training,Inference) | Edge Deployment Embedded System Architecture ● SKIL deployed on edge devices for ○ training of lightweight models ○ retraining of model / transfer learning ● Alternatively, models can be trained on central server. Agents on end devices swap models in and out of the devices ● Central SKIL server to navigate and coordinate between the edge devices SKIL SKIL SKIL Central SKIL server Robot SKIL Tablet SKIL Car SKIL
  • 40.
    Various deployment methods Localmachine (Execution on RPA robot execution machine) On prem server Cloud API ・ Low introduction cost ・ The period until introduction is short ・ No need for model preparation ・ Environment construction is easy ・ Resolvable issues are limited ・ Customization is difficult ·Pay-per-use ・ Low introduction cost ・ No need for additional infrastructure investment ・ The period until introduction is short ・ Flexible response to individual cases ・ Customizable flexibly ・ It is troublesome for individual environment construction ・ Scalability is low ・ High maintenance cost ・ Machine specification is low ・ Flexible response to individual cases ・ Customizable flexibly ・ Scalability is high ・ Machine spec high ・ There is infrastructure investment ・ Individual environment construction is relatively easy ・ High maintenance cost
  • 41.
    | Other considerations ●Precision/Memory Trade off for Models ● Model Compression for large models/constrained environments ● Hardware support (TPU does not support all ops, certain things only run on ARM/Intel,..) ● Model Quantization is useful. Optimization becoming more common