SlideShare a Scribd company logo
End to End AI Workflows
Skymind
Skymind Overview
Schedule
●9:30-10:00: Doors Open
●10:00-11:20: Workflow Overview
● 11:20-11:30: Break
●11:30-12:00: Hands on walkthrough of training models
●12:00-13:00: Lunch
●13:00-13:50: Deployment Scenarios/Considerations
● 13:50-14:00: Break
●14:00-14:50: Serving models Hands on
●15:00-15:50: DIfferent ways of Serving models overview
●16:00-16:15: Wrap
Workflow Overview● Organize and define issues and problems
● Organize effects
● Feasibility Study
● Experimental design
● Data collection
● Data organization / analysis / pre-processing
● Build a baseline model
● Redo preprocessing again, improve the model
● Tuning
● Evaluation
● Environment construction & deployment
● Post-deployment monitoring and model relearning
※ We will provide training that covers all of these in
detail.
| Challenges of ML in Enterprise
● Different teams have different technology preferences
● Experimentation deep learning framework not necessarily makes good production
frameworks
● Data scientists are great at experimentation, not so much in implementing production-ready
models and devops that focus
● Engineering/DevOps team rewrites models for production environment
Engineering,
Devops Team
Data Science Team
● Write deep learning
workflows and models in
notebooks or other
development IDE
● Refactor or rewrite models
for production environments
● Automate training and
optimization jobs
● Deploy models
Tools Overview
| Ecosystem
ML, DL Frameworks
Model Development Workflows
Runtime
Keras Tensorflow Deeplearning4J Scikit-Learn ...
Others
Feature Extraction Model Import Model Training
Hyperparameter
Optimization
...
Retraining
Model Performance
Monitoring
Model Versioning Job Scheduling ...
● More than a REST API
● Model calibration and input outlier
detection
● Monitor inputs and adapt to changes
in evolving data
Managing AI models over time
Human-In-The-LoopA/B Testing
Performance Monitoring
LEARNING LOOP
IN PRODUCTION
AI Model Decisions Retraining
Hot Swap
Scoping a Project
● Organize and define issues and problems
● Organize effects
● Feasibility Study
● Experimental design
● It usually takes just a few weeks, and it's quick.
● Isn't someone doing similar tasks somewhere?
● How much is it possible?
● What data do you prepare, what techniques do you use,
and how do you evaluate it?
● What kind of team configuration do you implement? What
skills do you need?
● What kind of infrastructure do you need?
| Feasibility study, experimental design
| Data
● Data Collection
● Data organization / analysis / pre-processing
● EDA
● Data Quality Assessment
| ETL/Data Collection
●Understand your Data Sources
●Do you have enough of the right kind of data?
●Can you access the data? (regulations)
●What is your vectorization pipeline?
●What are the expected data volumes we accumulate over
time? Per Day? Per month?
| Modeling
● Build Baseline Model
● Refine and Tune
● Try Different Architectures after baseline
● Rinse and Repeat as necessary evaluating your model
based on at minimum a train/test/validation split
| About model evaluation
● Is it okay to always measure with the same accuracy?
○ Diagnostic model in medical field vs automated driving vs customer
demand forecast ... etc
● It is necessary to change the evaluation method according to the problem
to be solved
● It is necessary to change the evaluation method according to the problem
to be solved
○ About Data Split
○ About the distribution of data
| Difference between learning
| and production use (Inference)
● Basically in a production environment, ML models do not update
● Models updated during training
● In a production environment, AI models freeze
○ Load a learned model from disk in advance and expand it in memory so
that it can respond to requests at any time
| Evaluation Data
● Always understand your training data
● What kind of evaluation data do you need?
● Data actually used by the company (production data)
● How much do you need?
● The more the better
● Pay particular attention to seasonality when collecting evaluation data
| Preparation of Evaluation Data
● What is the generalization performance?
● Why is generalization performance important?
● How do you measure it?
Trai
n
Dev Tes
t
| Deployment
● Environment Construction and Setup
● Intended usage scenarios (Batch, Real time,..)
● Resource usage understanding/provisioning
● Post-deployment monitoring and model relearning
About deployment
Deploy ML
Model learning
(training)
AI server
Model
Config
disk
定義ファイル
重みファイル
Model
learning
(training)
Definition
weight
※ The weight file is small
Dozens of megabytes, large ones are close to
GB.
※ In advance to perform real-time processing
Deploy in memory and process request
I need to wait.
Model
learning
(training)
Cloud API too
In principle it is the same!
| About AI execution environment
● Local machine
● Embedded (raspberry pi, phone, ..)
● Python script
● On prem server as part of application
● AI Workflow Platform(SKIL, Michelangelo, FB Learner, Sagemaker,..)
Model Deployment &
maintenance
What does it mean to deploy a learned model?
Where is the need for model maintenance?
About the last step left after getting a good model
I will introduce it.
| Monitoring After Deployment
● Why do we need monitoring?
● Concept Drift
● Example:
○ Marketing from the customer's online shopping behavior
○ Fraud detection
○ Data with Seasonality
※ As a premise, understand that data is ever changing
Human in the loop
Thinking
AI model assumes that it will never be 100%
Work on problem solving.
Even if the accuracy is not 100%, the effect can be achieved.
One such method is the idea of Human in the loop.
| Human-in-the-loop concept
● Instead of aiming for 100% accuracy (generally not possible)
● Have a recovery plan for when models fail. Prefer feedback.
input AI model Decision
Right?
Good.
Wrong?
Feedback.
| Calibrate your Models
● Requirements to realize Human in the loop:
● Many recent Deep Learning models are not correctly calibrated.
● Proper Confidence for Model Predictions
● Mandatory to determine whether to intervene in workflow
Model maintenance
| A bit more on Concept Drift
A phenomenon in which the characteristics of data continue to change with
time, and the characteristics of the data used for learning and the latest data
differ.
Example of data with Seasonality
● Customer Movement in Online Shop
○ Affects online advertising, promotion decisions.
● Product design document data
○ The format of the design document, the contents of the description, etc.
keep changing, which affects the reading accuracy.
| Countermeasures for Concept Drift
● In the day-to-day operation, store feedback from the operators.
● Perform continuous / periodic model performance checks and relearning
with newly stored data.
● Continuously monitor a key metric for frequency of feedback
● Sometimes batch retraining or tuning the model might be better
Deployment Scenarios
Virtual
Machine
(VM)
Bare Metal
Container
Host
Primitives
Configuration Manager
Hypervisor
(Example: Xen, KVM)
Automation Engine
(Example: Ansible)
Orchestrator
(Example: Kubernetes,
Marathon)
Automation platform perform configuration management,
software provisioning and application deployment.
Virtualization infrastructure create, manage and run one or
more virtual machines on a host machine
Orchestration platform manage, provision and scale
containerized applications
| Different Types of Hosts
● A server comprised of a set of hosts.
● Each server can be a combination of bare metal servers, virtual machine and
containers.
● Each host has a cluster membership indicating which cluster does it belongs
to.
| Server Virtualization
Single Server Multiple Servers Hybrid Cloud
| Configuration
● High Availability Using ZooKeeper
● Distributed Training and Inference on Multi-GPUs via Spark
● Integration with JVM ecosystem (Hadoop)
● Many SKIL server components are also embeddable in Java applications
Architecture
● Simple architecture for getting
started
● Cost effective solution for customers
with less than 100GB of Data
● Perfect for a DGX-1 Type system
CPU
OS
GPU
ZooKeeper
SKIL Training WorkspacesSKIL Deployments
Data Exploration / Training
SKIL Data Connectors
| Single Node
Scaled Out
Training Cluster
Architecture
● Any midrange VM or dedicated machine for
Zookeeper
● 1 or more Multi-GPU systems (DGX class or
 similar) for SKIL
● Gluster/HDFS provides global file system for data
| Multi-Node Training Cluster
Hybrid Cloud Cluster
(Cloud Data Storage)
Architecture
● DGX-1 Servers for SKIL with 8 P100/V100 GPUs
● Existing Hadoop cluster is used by SKIL for
○ ETL (Preparing data for training on GPU) or
○ Batch Inference for distributed scoring
with trained models.
| HYBRID CLOUD
Amazon s3
Azure
Blob
Architecture
● DGX-1 Servers for SKIL with 8 P100/V100 GPUs
● Private / public clouds such as
AWS EC2, Azure VM to serve models
| HYBRID CLOUD
Hybrid Cloud Cluster
(Cloud Server)
AWS EC2 GCP
GPU Training Cluster Architecture
● Powerful GPU Servers or Spark Cluster for training
models
● Separate (multiple) deployments-only clusters for
production deployments of ML models as REST APIs
CPU Inference Cluster
| MULTI-CLUSTER
Edge Inference Cluster
| Edge Deployment
IOT Device IOT Device
SKIL
IOT Device
public /
private cloud
IOT Device IOT Device IOT Device
on-premise cluster
SKIL
SKILSKIL
Architecture
● Inference purpose on edge devices
● (Optional) Retraining on powerful on-premise
server/cloud cluster
○ MHS tracks the performance metrics to prompt retraining
Edge Deployment Cluster
(Training, Inference)
| Edge Deployment
Embedded
System
Architecture
● SKIL deployed on edge devices for
○ training of lightweight models
○ retraining of model / transfer learning
● Alternatively, models can be trained on central server.
Agents on end devices swap models in and out of the
devices
● Central SKIL server to navigate
and coordinate between the edge devices
SKIL
SKIL
SKIL
Central SKIL server
Robot
SKIL
Tablet
SKIL
Car
SKIL
Various deployment methods
Local machine
(Execution on RPA robot execution
machine)
On prem server Cloud API
・ Low introduction cost
・ The period until
introduction is short
・ No need for model
preparation
・ Environment construction is
easy
・ Resolvable issues are
limited
・ Customization is difficult
·Pay-per-use
・ Low introduction cost
・ No need for additional
infrastructure investment
・ The period until introduction is
short
・ Flexible response to individual
cases
・ Customizable flexibly
・ It is troublesome for individual
environment construction
・ Scalability is low
・ High maintenance cost
・ Machine specification is low
・ Flexible response to individual
cases
・ Customizable flexibly
・ Scalability is high
・ Machine spec high
・ There is infrastructure
investment
・ Individual environment
construction is relatively easy
・ High maintenance cost
| Other considerations
● Precision/Memory Trade off for Models
● Model Compression for large models/constrained environments
● Hardware support (TPU does not support all ops, certain things only run on
ARM/Intel,..)
● Model Quantization is useful. Optimization becoming more common

More Related Content

Similar to End to end MLworkflows

Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Sotrender
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
Adam Gibson
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
Anant Corporation
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Anant Corporation
 
AWS Well Architected Framework in Summary
AWS Well Architected Framework in SummaryAWS Well Architected Framework in Summary
AWS Well Architected Framework in Summary
Ewere Diagboya
 
DevOps Days Rockies MLOps
DevOps Days Rockies MLOpsDevOps Days Rockies MLOps
DevOps Days Rockies MLOps
Matthew Reynolds
 
Microsoft exam (2)
Microsoft exam (2)Microsoft exam (2)
Microsoft exam (2)
Gaurav Dubey
 
MuleSoft Surat Virtual Meetup#32 - Implementing Command and Query Responsibil...
MuleSoft Surat Virtual Meetup#32 - Implementing Command and Query Responsibil...MuleSoft Surat Virtual Meetup#32 - Implementing Command and Query Responsibil...
MuleSoft Surat Virtual Meetup#32 - Implementing Command and Query Responsibil...
Jitendra Bafna
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
sundharakumarkb1
 
On the road to Engineering excellence
On the road to Engineering excellenceOn the road to Engineering excellence
On the road to Engineering excellence
Alexander Mrynskyi
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
EDB
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
eltonrodriguez11
 
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
DataScienceConferenc1
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
 
Evolution of DBA in the Cloud Era
 Evolution of DBA in the Cloud Era Evolution of DBA in the Cloud Era
Evolution of DBA in the Cloud Era
Mydbops
 
NLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated TrainingNLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated Training
Databricks
 
Michelangelo - Machine Learning Platform - 2018
Michelangelo - Machine Learning Platform - 2018Michelangelo - Machine Learning Platform - 2018
Michelangelo - Machine Learning Platform - 2018
Karthik Murugesan
 
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Costanoa Ventures
 

Similar to End to end MLworkflows (20)

Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
AWS Well Architected Framework in Summary
AWS Well Architected Framework in SummaryAWS Well Architected Framework in Summary
AWS Well Architected Framework in Summary
 
DevOps Days Rockies MLOps
DevOps Days Rockies MLOpsDevOps Days Rockies MLOps
DevOps Days Rockies MLOps
 
Microsoft exam (2)
Microsoft exam (2)Microsoft exam (2)
Microsoft exam (2)
 
MuleSoft Surat Virtual Meetup#32 - Implementing Command and Query Responsibil...
MuleSoft Surat Virtual Meetup#32 - Implementing Command and Query Responsibil...MuleSoft Surat Virtual Meetup#32 - Implementing Command and Query Responsibil...
MuleSoft Surat Virtual Meetup#32 - Implementing Command and Query Responsibil...
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
On the road to Engineering excellence
On the road to Engineering excellenceOn the road to Engineering excellence
On the road to Engineering excellence
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
 
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Evolution of DBA in the Cloud Era
 Evolution of DBA in the Cloud Era Evolution of DBA in the Cloud Era
Evolution of DBA in the Cloud Era
 
NLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated TrainingNLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated Training
 
Michelangelo - Machine Learning Platform - 2018
Michelangelo - Machine Learning Platform - 2018Michelangelo - Machine Learning Platform - 2018
Michelangelo - Machine Learning Platform - 2018
 
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
 

More from Adam Gibson

Deploying signature verification with deep learning
Deploying signature verification with deep learningDeploying signature verification with deep learning
Deploying signature verification with deep learning
Adam Gibson
 
Self driving computers active learning workflows with human interpretable ve...
Self driving computers  active learning workflows with human interpretable ve...Self driving computers  active learning workflows with human interpretable ve...
Self driving computers active learning workflows with human interpretable ve...
Adam Gibson
 
Anomaly Detection and Automatic Labeling with Deep Learning
Anomaly Detection and Automatic Labeling with Deep LearningAnomaly Detection and Automatic Labeling with Deep Learning
Anomaly Detection and Automatic Labeling with Deep Learning
Adam Gibson
 
Strata Beijing 2017: Jumpy, a python interface for nd4j
Strata Beijing 2017: Jumpy, a python interface for nd4jStrata Beijing 2017: Jumpy, a python interface for nd4j
Strata Beijing 2017: Jumpy, a python interface for nd4j
Adam Gibson
 
Boolan machine learning summit
Boolan machine learning summitBoolan machine learning summit
Boolan machine learning summit
Adam Gibson
 
Advanced deeplearning4j features
Advanced deeplearning4j featuresAdvanced deeplearning4j features
Advanced deeplearning4j features
Adam Gibson
 
Deep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the BayDeep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the Bay
Adam Gibson
 
Big Data Analytics Tokyo
Big Data Analytics TokyoBig Data Analytics Tokyo
Big Data Analytics Tokyo
Adam Gibson
 
Wrangleconf Big Data Malaysia 2016
Wrangleconf Big Data Malaysia 2016Wrangleconf Big Data Malaysia 2016
Wrangleconf Big Data Malaysia 2016
Adam Gibson
 
Distributed deep rl on spark strata singapore
Distributed deep rl on spark   strata singaporeDistributed deep rl on spark   strata singapore
Distributed deep rl on spark strata singapore
Adam Gibson
 
Deep learning in production with the best
Deep learning in production   with the bestDeep learning in production   with the best
Deep learning in production with the best
Adam Gibson
 
Dl4j in the wild
Dl4j in the wildDl4j in the wild
Dl4j in the wild
Adam Gibson
 
SKIL - Dl4j in the wild meetup
SKIL - Dl4j in the wild meetupSKIL - Dl4j in the wild meetup
SKIL - Dl4j in the wild meetup
Adam Gibson
 
Strata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on SparkStrata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on Spark
Adam Gibson
 
Anomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) EnglishAnomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) English
Adam Gibson
 
Skymind - Udacity China presentation
Skymind - Udacity China presentationSkymind - Udacity China presentation
Skymind - Udacity China presentation
Adam Gibson
 
Anomaly Detection in Deep Learning (Updated)
Anomaly Detection in Deep Learning (Updated)Anomaly Detection in Deep Learning (Updated)
Anomaly Detection in Deep Learning (Updated)
Adam Gibson
 
Hadoop summit 2016
Hadoop summit 2016Hadoop summit 2016
Hadoop summit 2016
Adam Gibson
 
Anomaly detection in deep learning
Anomaly detection in deep learningAnomaly detection in deep learning
Anomaly detection in deep learning
Adam Gibson
 
Brief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep LearningBrief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep Learning
Adam Gibson
 

More from Adam Gibson (20)

Deploying signature verification with deep learning
Deploying signature verification with deep learningDeploying signature verification with deep learning
Deploying signature verification with deep learning
 
Self driving computers active learning workflows with human interpretable ve...
Self driving computers  active learning workflows with human interpretable ve...Self driving computers  active learning workflows with human interpretable ve...
Self driving computers active learning workflows with human interpretable ve...
 
Anomaly Detection and Automatic Labeling with Deep Learning
Anomaly Detection and Automatic Labeling with Deep LearningAnomaly Detection and Automatic Labeling with Deep Learning
Anomaly Detection and Automatic Labeling with Deep Learning
 
Strata Beijing 2017: Jumpy, a python interface for nd4j
Strata Beijing 2017: Jumpy, a python interface for nd4jStrata Beijing 2017: Jumpy, a python interface for nd4j
Strata Beijing 2017: Jumpy, a python interface for nd4j
 
Boolan machine learning summit
Boolan machine learning summitBoolan machine learning summit
Boolan machine learning summit
 
Advanced deeplearning4j features
Advanced deeplearning4j featuresAdvanced deeplearning4j features
Advanced deeplearning4j features
 
Deep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the BayDeep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the Bay
 
Big Data Analytics Tokyo
Big Data Analytics TokyoBig Data Analytics Tokyo
Big Data Analytics Tokyo
 
Wrangleconf Big Data Malaysia 2016
Wrangleconf Big Data Malaysia 2016Wrangleconf Big Data Malaysia 2016
Wrangleconf Big Data Malaysia 2016
 
Distributed deep rl on spark strata singapore
Distributed deep rl on spark   strata singaporeDistributed deep rl on spark   strata singapore
Distributed deep rl on spark strata singapore
 
Deep learning in production with the best
Deep learning in production   with the bestDeep learning in production   with the best
Deep learning in production with the best
 
Dl4j in the wild
Dl4j in the wildDl4j in the wild
Dl4j in the wild
 
SKIL - Dl4j in the wild meetup
SKIL - Dl4j in the wild meetupSKIL - Dl4j in the wild meetup
SKIL - Dl4j in the wild meetup
 
Strata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on SparkStrata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on Spark
 
Anomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) EnglishAnomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) English
 
Skymind - Udacity China presentation
Skymind - Udacity China presentationSkymind - Udacity China presentation
Skymind - Udacity China presentation
 
Anomaly Detection in Deep Learning (Updated)
Anomaly Detection in Deep Learning (Updated)Anomaly Detection in Deep Learning (Updated)
Anomaly Detection in Deep Learning (Updated)
 
Hadoop summit 2016
Hadoop summit 2016Hadoop summit 2016
Hadoop summit 2016
 
Anomaly detection in deep learning
Anomaly detection in deep learningAnomaly detection in deep learning
Anomaly detection in deep learning
 
Brief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep LearningBrief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep Learning
 

Recently uploaded

Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 

Recently uploaded (20)

Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 

End to end MLworkflows

  • 1. End to End AI Workflows Skymind
  • 3. Schedule ●9:30-10:00: Doors Open ●10:00-11:20: Workflow Overview ● 11:20-11:30: Break ●11:30-12:00: Hands on walkthrough of training models ●12:00-13:00: Lunch ●13:00-13:50: Deployment Scenarios/Considerations ● 13:50-14:00: Break ●14:00-14:50: Serving models Hands on ●15:00-15:50: DIfferent ways of Serving models overview ●16:00-16:15: Wrap
  • 4. Workflow Overview● Organize and define issues and problems ● Organize effects ● Feasibility Study ● Experimental design ● Data collection ● Data organization / analysis / pre-processing ● Build a baseline model ● Redo preprocessing again, improve the model ● Tuning ● Evaluation ● Environment construction & deployment ● Post-deployment monitoring and model relearning ※ We will provide training that covers all of these in detail.
  • 5. | Challenges of ML in Enterprise ● Different teams have different technology preferences ● Experimentation deep learning framework not necessarily makes good production frameworks ● Data scientists are great at experimentation, not so much in implementing production-ready models and devops that focus ● Engineering/DevOps team rewrites models for production environment Engineering, Devops Team Data Science Team ● Write deep learning workflows and models in notebooks or other development IDE ● Refactor or rewrite models for production environments ● Automate training and optimization jobs ● Deploy models
  • 6. Tools Overview | Ecosystem ML, DL Frameworks Model Development Workflows Runtime Keras Tensorflow Deeplearning4J Scikit-Learn ... Others Feature Extraction Model Import Model Training Hyperparameter Optimization ... Retraining Model Performance Monitoring Model Versioning Job Scheduling ...
  • 7. ● More than a REST API ● Model calibration and input outlier detection ● Monitor inputs and adapt to changes in evolving data Managing AI models over time Human-In-The-LoopA/B Testing Performance Monitoring LEARNING LOOP IN PRODUCTION AI Model Decisions Retraining Hot Swap
  • 8. Scoping a Project ● Organize and define issues and problems ● Organize effects ● Feasibility Study ● Experimental design
  • 9. ● It usually takes just a few weeks, and it's quick. ● Isn't someone doing similar tasks somewhere? ● How much is it possible? ● What data do you prepare, what techniques do you use, and how do you evaluate it? ● What kind of team configuration do you implement? What skills do you need? ● What kind of infrastructure do you need? | Feasibility study, experimental design
  • 10. | Data ● Data Collection ● Data organization / analysis / pre-processing ● EDA ● Data Quality Assessment
  • 11. | ETL/Data Collection ●Understand your Data Sources ●Do you have enough of the right kind of data? ●Can you access the data? (regulations) ●What is your vectorization pipeline? ●What are the expected data volumes we accumulate over time? Per Day? Per month?
  • 12. | Modeling ● Build Baseline Model ● Refine and Tune ● Try Different Architectures after baseline ● Rinse and Repeat as necessary evaluating your model based on at minimum a train/test/validation split
  • 13. | About model evaluation ● Is it okay to always measure with the same accuracy? ○ Diagnostic model in medical field vs automated driving vs customer demand forecast ... etc ● It is necessary to change the evaluation method according to the problem to be solved ● It is necessary to change the evaluation method according to the problem to be solved ○ About Data Split ○ About the distribution of data
  • 14. | Difference between learning | and production use (Inference) ● Basically in a production environment, ML models do not update ● Models updated during training ● In a production environment, AI models freeze ○ Load a learned model from disk in advance and expand it in memory so that it can respond to requests at any time
  • 15. | Evaluation Data ● Always understand your training data ● What kind of evaluation data do you need? ● Data actually used by the company (production data) ● How much do you need? ● The more the better ● Pay particular attention to seasonality when collecting evaluation data
  • 16. | Preparation of Evaluation Data ● What is the generalization performance? ● Why is generalization performance important? ● How do you measure it? Trai n Dev Tes t
  • 17. | Deployment ● Environment Construction and Setup ● Intended usage scenarios (Batch, Real time,..) ● Resource usage understanding/provisioning ● Post-deployment monitoring and model relearning
  • 19. Deploy ML Model learning (training) AI server Model Config disk 定義ファイル 重みファイル Model learning (training) Definition weight ※ The weight file is small Dozens of megabytes, large ones are close to GB. ※ In advance to perform real-time processing Deploy in memory and process request I need to wait. Model learning (training) Cloud API too In principle it is the same!
  • 20. | About AI execution environment ● Local machine ● Embedded (raspberry pi, phone, ..) ● Python script ● On prem server as part of application ● AI Workflow Platform(SKIL, Michelangelo, FB Learner, Sagemaker,..)
  • 21. Model Deployment & maintenance What does it mean to deploy a learned model? Where is the need for model maintenance? About the last step left after getting a good model I will introduce it.
  • 22. | Monitoring After Deployment ● Why do we need monitoring? ● Concept Drift ● Example: ○ Marketing from the customer's online shopping behavior ○ Fraud detection ○ Data with Seasonality ※ As a premise, understand that data is ever changing
  • 23. Human in the loop Thinking AI model assumes that it will never be 100% Work on problem solving. Even if the accuracy is not 100%, the effect can be achieved. One such method is the idea of Human in the loop.
  • 24. | Human-in-the-loop concept ● Instead of aiming for 100% accuracy (generally not possible) ● Have a recovery plan for when models fail. Prefer feedback. input AI model Decision Right? Good. Wrong? Feedback.
  • 25. | Calibrate your Models ● Requirements to realize Human in the loop: ● Many recent Deep Learning models are not correctly calibrated. ● Proper Confidence for Model Predictions ● Mandatory to determine whether to intervene in workflow
  • 27. | A bit more on Concept Drift A phenomenon in which the characteristics of data continue to change with time, and the characteristics of the data used for learning and the latest data differ. Example of data with Seasonality ● Customer Movement in Online Shop ○ Affects online advertising, promotion decisions. ● Product design document data ○ The format of the design document, the contents of the description, etc. keep changing, which affects the reading accuracy.
  • 28. | Countermeasures for Concept Drift ● In the day-to-day operation, store feedback from the operators. ● Perform continuous / periodic model performance checks and relearning with newly stored data. ● Continuously monitor a key metric for frequency of feedback ● Sometimes batch retraining or tuning the model might be better
  • 30. Virtual Machine (VM) Bare Metal Container Host Primitives Configuration Manager Hypervisor (Example: Xen, KVM) Automation Engine (Example: Ansible) Orchestrator (Example: Kubernetes, Marathon) Automation platform perform configuration management, software provisioning and application deployment. Virtualization infrastructure create, manage and run one or more virtual machines on a host machine Orchestration platform manage, provision and scale containerized applications | Different Types of Hosts
  • 31. ● A server comprised of a set of hosts. ● Each server can be a combination of bare metal servers, virtual machine and containers. ● Each host has a cluster membership indicating which cluster does it belongs to. | Server Virtualization
  • 32. Single Server Multiple Servers Hybrid Cloud | Configuration ● High Availability Using ZooKeeper ● Distributed Training and Inference on Multi-GPUs via Spark ● Integration with JVM ecosystem (Hadoop) ● Many SKIL server components are also embeddable in Java applications
  • 33. Architecture ● Simple architecture for getting started ● Cost effective solution for customers with less than 100GB of Data ● Perfect for a DGX-1 Type system CPU OS GPU ZooKeeper SKIL Training WorkspacesSKIL Deployments Data Exploration / Training SKIL Data Connectors | Single Node
  • 34. Scaled Out Training Cluster Architecture ● Any midrange VM or dedicated machine for Zookeeper ● 1 or more Multi-GPU systems (DGX class or  similar) for SKIL ● Gluster/HDFS provides global file system for data | Multi-Node Training Cluster
  • 35. Hybrid Cloud Cluster (Cloud Data Storage) Architecture ● DGX-1 Servers for SKIL with 8 P100/V100 GPUs ● Existing Hadoop cluster is used by SKIL for ○ ETL (Preparing data for training on GPU) or ○ Batch Inference for distributed scoring with trained models. | HYBRID CLOUD Amazon s3 Azure Blob
  • 36. Architecture ● DGX-1 Servers for SKIL with 8 P100/V100 GPUs ● Private / public clouds such as AWS EC2, Azure VM to serve models | HYBRID CLOUD Hybrid Cloud Cluster (Cloud Server) AWS EC2 GCP
  • 37. GPU Training Cluster Architecture ● Powerful GPU Servers or Spark Cluster for training models ● Separate (multiple) deployments-only clusters for production deployments of ML models as REST APIs CPU Inference Cluster | MULTI-CLUSTER
  • 38. Edge Inference Cluster | Edge Deployment IOT Device IOT Device SKIL IOT Device public / private cloud IOT Device IOT Device IOT Device on-premise cluster SKIL SKILSKIL Architecture ● Inference purpose on edge devices ● (Optional) Retraining on powerful on-premise server/cloud cluster ○ MHS tracks the performance metrics to prompt retraining
  • 39. Edge Deployment Cluster (Training, Inference) | Edge Deployment Embedded System Architecture ● SKIL deployed on edge devices for ○ training of lightweight models ○ retraining of model / transfer learning ● Alternatively, models can be trained on central server. Agents on end devices swap models in and out of the devices ● Central SKIL server to navigate and coordinate between the edge devices SKIL SKIL SKIL Central SKIL server Robot SKIL Tablet SKIL Car SKIL
  • 40. Various deployment methods Local machine (Execution on RPA robot execution machine) On prem server Cloud API ・ Low introduction cost ・ The period until introduction is short ・ No need for model preparation ・ Environment construction is easy ・ Resolvable issues are limited ・ Customization is difficult ·Pay-per-use ・ Low introduction cost ・ No need for additional infrastructure investment ・ The period until introduction is short ・ Flexible response to individual cases ・ Customizable flexibly ・ It is troublesome for individual environment construction ・ Scalability is low ・ High maintenance cost ・ Machine specification is low ・ Flexible response to individual cases ・ Customizable flexibly ・ Scalability is high ・ Machine spec high ・ There is infrastructure investment ・ Individual environment construction is relatively easy ・ High maintenance cost
  • 41. | Other considerations ● Precision/Memory Trade off for Models ● Model Compression for large models/constrained environments ● Hardware support (TPU does not support all ops, certain things only run on ARM/Intel,..) ● Model Quantization is useful. Optimization becoming more common