https://www.flickr.com/photos/mwichary/3209181446
Virtual Machines
CPU GPU FPGA
Data Bricks Batch AI AKS
Azure Machine Learning SDK
Your Applications and Pipelines
https://www.flickr.com/photos/mwichary/4145891520
https://github.com/Azure/pixel_level_land_classification
Purpose Graphics
VM Family NV v1
GPU NVIDIA M60
GPU Memory 8 GB
Sizes 1, 2 or 4 GPU
Interconnect PCIe (dual root)
2nd Network
VM CPU Haswell
VM RAM 56-224 GB
Local SSD ~380-1500 GB
Storage Std Storage
Driver Quadro/Grid PC
Purpose Graphics Compute Compute Compute
VM Family NV v1 NC v1 NC v2 NC v3
GPU NVIDIA M60 NVIDIA K80 NVIDIA P100 NVIDIA V100
GPU Memory 8 GB 8 GB 16 GB 16 GB
Sizes 1, 2 or 4 GPU 1, 2 or 4 GPU 1, 2 or 4 GPU 1, 2 or 4 GPU
Interconnect PCIe (dual root) PCIe (dual root) PCIe (dual root) PCIe (dual root)
2nd Network FDR InfiniBand FDR InfiniBand FDR InfiniBand
VM CPU Haswell Haswell Broadwell Broadwell
VM RAM 56-224 GB 56-224 GB 112-448 GB 112-448 GB
Local SSD ~380-1500 GB ~380-1500 GB ~700-3000 GB ~700-3000 GB
Storage Std Storage Std Storage Prem Storage Prem Storage
Driver Quadro/Grid PC Tesla Tesla Tesla
Purpose Graphics Compute Compute Compute Deep Learning
VM Family NV v1 NC v1 NC v2 NC v3 ND v1
GPU NVIDIA M60 NVIDIA K80 NVIDIA P100 NVIDIA V100 NVIDIA P40
GPU Memory 8 GB 8 GB 16 GB 16 GB 24 GB
Sizes 1, 2 or 4 GPU 1, 2 or 4 GPU 1, 2 or 4 GPU 1, 2 or 4 GPU 1, 2 or 4 GPU
Interconnect PCIe (dual root) PCIe (dual root) PCIe (dual root) PCIe (dual root) PCIe (dual root)
2nd Network FDR InfiniBand FDR InfiniBand FDR InfiniBand FDR InfiniBand
VM CPU Haswell Haswell Broadwell Broadwell Broadwell
VM RAM 56-224 GB 56-224 GB 112-448 GB 112-448 GB 112-448 GB
Local SSD ~380-1500 GB ~380-1500 GB ~700-3000 GB ~700-3000 GB ~700-3000 GB
Storage Std Storage Std Storage Prem Storage Prem Storage Prem Storage
Driver Quadro/Grid PC Tesla Tesla Tesla Tesla
Purpose Graphics
VM Family NV v2
GPU NVIDIA M60
GPU Memory 8 GB
Sizes 1, 2 or 4 GPU
Interconnect PCIe (dual root)
2nd Network
VM CPU Broadwell
VM RAM 112-448 GB
Local SSD ~700-3000 GB
Storage Prem Storage
Driver Quadro/Grid PC
Purpose Graphics Deep Learning
VM Family NV v2 ND v2
GPU NVIDIA M60 NVIDIA V100
GPU Memory 8 GB 16 GB
Sizes 1, 2 or 4 GPU 8 GPU
Interconnect PCIe (dual root) NVLink
2nd Network
VM CPU Broadwell Skylake
VM RAM 112-448 GB 768 GB
Local SSD ~700-3000 GB ~1300 GB
Storage Prem Storage Prem Storage
Driver Quadro/Grid PC Tesla
https://www.flickr.com/photos/mwichary/4358127163
Clusters
• Provision GPUs
• Install drivers
and software
• Interactive use
Scheduling
• Queue work
• Prioritize jobs
• Start MPI
• Monitor
• Handle failures
Data
• Scale access to
training data
• Output logs &
models
• Secure &
compliant
Cost
• Scale up and
down
• Share reserved
instances
• Use low priority
Workflow
• Efficient
hardware
• Tooling
integration
• Laptop to cloud
New API coming soon
https://github.com/Azure/BatchAI/tree/master/recipes
 Azure Blob FUSE: Default in samples going forward
The AI solution challenge For VIP
Highly Adaptive
Meet different Geo-Location’s
needs with different data sources,
product SKUs, tag system and
culture style which result significant
difference in modeling and
predication.
Rapid Iteration
All data acquisition, manipulation,
feature engineering, modeling need
to be iterated super fast.
Easy to replicate
Support new geo-locations with
small local teams
Global Foundation
Needs to be built on a reliable
global platform which can align and
support VIP.com’s global expansion.
8:00 10:00 14:0012:00 16:00 18:00
AI
The largest flash-sale website in China
20 Billion USD GMV, 300 million active
users per month, the first shoppingchoice
for women & families
The core strategy is Globalization which
requires a platform to reach customers
worldwide.
We use AI as the core of our products to
learn the different scenarios for people of
each country, like reading habits,
preferences, shopping decisions,
etc. Recommendation algorithms help
match demand or extend cognition,
making shopping decisions easier and
more convenient. America / Europe / South Korea / South East Asia / Russia
To achieve this, we need to establish a
broad labeling system, flexible
configuration of model algorithm in
seconds, to train tens of millions of
items, all by a small team using rapid
development.
About Caicloud and Solution for VIP.com
Highly Adaptive
Provided a deep learning based e-
commerce tag system with more than
thousands of labels including product
type, texture, color, fashion style,
country style, point style and etc. The
descriptive power of this AI tagging
system has enabled VIP.com to adapt
quickly to different markets.
Rapid Iteration
We have integrated a pipeline
consisting of Azure Batch AI,
Caicloud agile AI devops product
Clever and Caicloud model library
product Cabernet. By leveraging the
scalable power from Batch AI, we
can quickly iterate over the models.
Easy to Replicate
The solution has been packaged
and containerized in Caicloud
Kubernetes product Compass. We
can set up and shutting down any
part of the whole solution in
seconds and scale out to huge
volume of traffic.
Global Foundation
The solution has been built on
Azure Global and has the ability to
manage the difference between
different geo locations.
✓ 6 CMU alumni with Ph.D and master degrees in the core team
✓ Many former Google AI engineers
✓ 3 of 11 core approvers with write access in Kubeflow Open Source Project
✓ 2 best-seller books focused on Tensorflow in China
✓ Data scientists with Top 1% rank on Kaggle
✓ Kubernetes & Tensorflow China Community leader
✓ A Microsoft OCP (One Commercial Partner) Managed ISV
Architecture: Tooling, Inference and Training
Images / Text / log /
Times / Personal info.
Recommendation Service
Sorting Algorithm / ML /Deep
Learning / Special Rules
Data Governance Model Development
TensorFlow
Deep Learning / ML
Kubernetes / Resource Management
Container
Application Centralization
Schedule Proxy Data Image Nginx
Dash-
board
CI/CD
CPU GPU
API
Label Model
Training
Job
Monitor
Job Management
Toolkit
TensorFlow
Compute Cluster Management
Low-
Priority
VMs
IB
Connected
GPUS
Auto
Scaling
Azure Storage
Run job inside containers, distributed
across multiple nodes for scale
on Azure
on Azure
Batch AI on Azure
 VIP.com is able use to drive their global expansion
business 85% more efficiently by using Azure
 Train multiple models in parallel: Caicloud’s training
time shortened from 2 months to 1 week
 Labeling one million images with 64 GPUs takes only
1% of the original processing time on-premises
 Collaboration across companies and regions
Project Highlights
vip.com
Batch AI
o VIP.com (Customer) + Caicloud (Partner) + Microsoft
o Microsoft China OCP (One Commercial Partner) and
Redmond Batch AI Product Team
o 3-days Hackfest with Microsoft to deep dive on solution
Empowered By
https://www.flickr.com/photos/mwichary/4358122657
https://github.com/saidbleik/batchai_mm_ad
https://github.com/Azure/doAzureParallel
https://github.com/Azure/aztk/
https://azure.github.io/batch-shipyard/
https://github.com/Azure/batch-shipyard
https://www.flickr.com/photos/mwichary/4148855707
Virtual Machines
CPU GPU FPGA
Data Bricks Batch AI AKS
Azure Machine Learning SDK
Your Applications and Pipelines
https://docs.microsoft.com/azure/batch-ai/
https://github.com/Azure/BatchAI
AzureBatchAITrainingPreview@service.microsoft.com
Deep learning at scale in Azure
Deep learning at scale in Azure

Deep learning at scale in Azure

  • 3.
  • 5.
    Virtual Machines CPU GPUFPGA Data Bricks Batch AI AKS Azure Machine Learning SDK Your Applications and Pipelines
  • 7.
  • 9.
  • 11.
    Purpose Graphics VM FamilyNV v1 GPU NVIDIA M60 GPU Memory 8 GB Sizes 1, 2 or 4 GPU Interconnect PCIe (dual root) 2nd Network VM CPU Haswell VM RAM 56-224 GB Local SSD ~380-1500 GB Storage Std Storage Driver Quadro/Grid PC
  • 12.
    Purpose Graphics ComputeCompute Compute VM Family NV v1 NC v1 NC v2 NC v3 GPU NVIDIA M60 NVIDIA K80 NVIDIA P100 NVIDIA V100 GPU Memory 8 GB 8 GB 16 GB 16 GB Sizes 1, 2 or 4 GPU 1, 2 or 4 GPU 1, 2 or 4 GPU 1, 2 or 4 GPU Interconnect PCIe (dual root) PCIe (dual root) PCIe (dual root) PCIe (dual root) 2nd Network FDR InfiniBand FDR InfiniBand FDR InfiniBand VM CPU Haswell Haswell Broadwell Broadwell VM RAM 56-224 GB 56-224 GB 112-448 GB 112-448 GB Local SSD ~380-1500 GB ~380-1500 GB ~700-3000 GB ~700-3000 GB Storage Std Storage Std Storage Prem Storage Prem Storage Driver Quadro/Grid PC Tesla Tesla Tesla
  • 13.
    Purpose Graphics ComputeCompute Compute Deep Learning VM Family NV v1 NC v1 NC v2 NC v3 ND v1 GPU NVIDIA M60 NVIDIA K80 NVIDIA P100 NVIDIA V100 NVIDIA P40 GPU Memory 8 GB 8 GB 16 GB 16 GB 24 GB Sizes 1, 2 or 4 GPU 1, 2 or 4 GPU 1, 2 or 4 GPU 1, 2 or 4 GPU 1, 2 or 4 GPU Interconnect PCIe (dual root) PCIe (dual root) PCIe (dual root) PCIe (dual root) PCIe (dual root) 2nd Network FDR InfiniBand FDR InfiniBand FDR InfiniBand FDR InfiniBand VM CPU Haswell Haswell Broadwell Broadwell Broadwell VM RAM 56-224 GB 56-224 GB 112-448 GB 112-448 GB 112-448 GB Local SSD ~380-1500 GB ~380-1500 GB ~700-3000 GB ~700-3000 GB ~700-3000 GB Storage Std Storage Std Storage Prem Storage Prem Storage Prem Storage Driver Quadro/Grid PC Tesla Tesla Tesla Tesla
  • 14.
    Purpose Graphics VM FamilyNV v2 GPU NVIDIA M60 GPU Memory 8 GB Sizes 1, 2 or 4 GPU Interconnect PCIe (dual root) 2nd Network VM CPU Broadwell VM RAM 112-448 GB Local SSD ~700-3000 GB Storage Prem Storage Driver Quadro/Grid PC
  • 15.
    Purpose Graphics DeepLearning VM Family NV v2 ND v2 GPU NVIDIA M60 NVIDIA V100 GPU Memory 8 GB 16 GB Sizes 1, 2 or 4 GPU 8 GPU Interconnect PCIe (dual root) NVLink 2nd Network VM CPU Broadwell Skylake VM RAM 112-448 GB 768 GB Local SSD ~700-3000 GB ~1300 GB Storage Prem Storage Prem Storage Driver Quadro/Grid PC Tesla
  • 17.
  • 18.
    Clusters • Provision GPUs •Install drivers and software • Interactive use Scheduling • Queue work • Prioritize jobs • Start MPI • Monitor • Handle failures Data • Scale access to training data • Output logs & models • Secure & compliant Cost • Scale up and down • Share reserved instances • Use low priority Workflow • Efficient hardware • Tooling integration • Laptop to cloud
  • 21.
  • 23.
  • 26.
     Azure BlobFUSE: Default in samples going forward
  • 32.
    The AI solutionchallenge For VIP Highly Adaptive Meet different Geo-Location’s needs with different data sources, product SKUs, tag system and culture style which result significant difference in modeling and predication. Rapid Iteration All data acquisition, manipulation, feature engineering, modeling need to be iterated super fast. Easy to replicate Support new geo-locations with small local teams Global Foundation Needs to be built on a reliable global platform which can align and support VIP.com’s global expansion. 8:00 10:00 14:0012:00 16:00 18:00 AI The largest flash-sale website in China 20 Billion USD GMV, 300 million active users per month, the first shoppingchoice for women & families The core strategy is Globalization which requires a platform to reach customers worldwide. We use AI as the core of our products to learn the different scenarios for people of each country, like reading habits, preferences, shopping decisions, etc. Recommendation algorithms help match demand or extend cognition, making shopping decisions easier and more convenient. America / Europe / South Korea / South East Asia / Russia To achieve this, we need to establish a broad labeling system, flexible configuration of model algorithm in seconds, to train tens of millions of items, all by a small team using rapid development.
  • 33.
    About Caicloud andSolution for VIP.com Highly Adaptive Provided a deep learning based e- commerce tag system with more than thousands of labels including product type, texture, color, fashion style, country style, point style and etc. The descriptive power of this AI tagging system has enabled VIP.com to adapt quickly to different markets. Rapid Iteration We have integrated a pipeline consisting of Azure Batch AI, Caicloud agile AI devops product Clever and Caicloud model library product Cabernet. By leveraging the scalable power from Batch AI, we can quickly iterate over the models. Easy to Replicate The solution has been packaged and containerized in Caicloud Kubernetes product Compass. We can set up and shutting down any part of the whole solution in seconds and scale out to huge volume of traffic. Global Foundation The solution has been built on Azure Global and has the ability to manage the difference between different geo locations. ✓ 6 CMU alumni with Ph.D and master degrees in the core team ✓ Many former Google AI engineers ✓ 3 of 11 core approvers with write access in Kubeflow Open Source Project ✓ 2 best-seller books focused on Tensorflow in China ✓ Data scientists with Top 1% rank on Kaggle ✓ Kubernetes & Tensorflow China Community leader ✓ A Microsoft OCP (One Commercial Partner) Managed ISV
  • 34.
    Architecture: Tooling, Inferenceand Training Images / Text / log / Times / Personal info. Recommendation Service Sorting Algorithm / ML /Deep Learning / Special Rules Data Governance Model Development TensorFlow Deep Learning / ML Kubernetes / Resource Management Container Application Centralization Schedule Proxy Data Image Nginx Dash- board CI/CD CPU GPU API Label Model Training Job Monitor Job Management Toolkit TensorFlow Compute Cluster Management Low- Priority VMs IB Connected GPUS Auto Scaling Azure Storage Run job inside containers, distributed across multiple nodes for scale on Azure on Azure Batch AI on Azure
  • 35.
     VIP.com isable use to drive their global expansion business 85% more efficiently by using Azure  Train multiple models in parallel: Caicloud’s training time shortened from 2 months to 1 week  Labeling one million images with 64 GPUs takes only 1% of the original processing time on-premises  Collaboration across companies and regions Project Highlights vip.com Batch AI o VIP.com (Customer) + Caicloud (Partner) + Microsoft o Microsoft China OCP (One Commercial Partner) and Redmond Batch AI Product Team o 3-days Hackfest with Microsoft to deep dive on solution Empowered By
  • 37.
  • 38.
  • 41.
  • 42.
  • 44.
    Virtual Machines CPU GPUFPGA Data Bricks Batch AI AKS Azure Machine Learning SDK Your Applications and Pipelines
  • 45.

Editor's Notes