Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Scaling Vision Models Using Caffe2 on AWS
P i e t e r N o o r d h u i s | C a f f e 2 E n g i n e e r i n g a t F a c e b o o k
J o s e p h S p i s a k | H e a d o f A I / M L P a r t n e r s h i p s a t A W S
N o v e m b e r 2 9 , 2 0 1 7
AWS re:INVENT

Agenda
The AWS ML Strategy
Deep learning and GPU compute
The Caffe2 story
Amazon + Facebook = optimized Caffe2 on AWS
How to get started…
Key takeaways and call to action

Enable customers (at all levels of expertise) to
build machine learning-driven applications
ML @ AWS: Our mission

ML in the Hands of Every Developer
Services
Platforms
Frameworks
Infrastructure

Frameworks &
Infrastructure
AWS Deep Learning AMI
GPU
(P3 Instances)
MobileCPU
IoT
(Greengrass)
Vision:
Rekognition Image
Rekognition Video
Speech:
Amazon Polly
Transcribe
Language:
Amazon Lex Translate
Comprehend
Apache
MXNet
PyTorch
Cognitive
Toolkit
Keras
Caffe2
& Caffe
TensorFlow Gluon
AWS ML Stack
Application
Services
Platform
Services
Amazon Machine
Learning
Mechanical
Turk
Spark &
EMR
Amazon
SageMaker
AWS
DeepLens
Amazon confidential

Frameworks &
Infrastructure
GPU
(P3 Instances)
MobileCPU
IoT
(Greengrass)
Vision:
Rekognition Image
Rekognition Video
Speech:
Polly
Transcribe
Language:
Lex Translate
Comprehend
Apache
MXNet
PyTorch
Cognitive
Toolkit
Keras
Caffe2
& Caffe
TensorFlow Gluon
AWS ML Stack
Application
Services
Platform
Services
Amazon Machine
Learning
Mechanical
Turk
Spark &
EMR
Amazon
SageMaker
AWS
DeepLens
Amazon confidential

Amazon EC2 P3 Instances (October 2017)
• Up to eight NVIDIA Tesla V100 GPUs
• 1 PetaFLOP of computational performance
– 14x better than P2
• 300 GB/s GPU-to-GPU communication
(NVLink) – 9X better than P2
• 16GB GPU memory with 900 GB/sec peak
GPU memory bandwidth
T h e f a s t e s t , m o s t p o w e r f u l G P U i n s t a n c e s i n t h e c l o u d

• Get started quickly with easy-to-launch tutorials
• Hassle-free setup and configuration
• Pay only for what you use – no additional charge for
the AMI
• Accelerate your model training and deployment
• Support for popular deep learning frameworks

Open Neural Network Exchange (ONNX)
• Developers can choose the framework that best fits their needs
• More customers can take advantage of MXNet’s performance and scalability
• MXNet users to run their model on various mobile and edge devices
(Qualcomm, Huawei, Intel, and ARM announced support for ONNX)

Caffe2

• Grad student-driven project
• Focuses on CV applications
• Adopted by industry
• #2 DL framework in popularity
The original Caffe

• Full computation graph
• First-class distributed support
• Cross-platform
• CV / NLP / speech / ranking
Caffe2 brings…

Caffe2 uses CMake and builds on:
• Linux / Mac
• Windows
• iOS
• Android
• Tegra K1/X2
• Raspberry Pi
Cross-platform support

NNPack
cuDNN Metal
Custom
Code
Modularity

Multiple backend support
• cuDNN
• MKLDNN
• Metal
• Snapdragon NPE
Easy extensions
• caffe2/contrib/...
• Or a custom extension!
class MyTSNEOp : public
Operator<CPUContext> {…};
REGISTER_OPERATOR(
TSNE, MyTSNEOp);
Modularity Enables

Defining a model
• Model is container for all ops
• Model can convert to protobuf
• Argument scope for Brew API
• Brew API is set of factory functions
• Image input is an operator
• ”reader” is object that can:
• seek(N)
• read()  (data, label)
• Image augmentation

Defining a model
• Helper function to define ResNet-50
• “data” comes from image input
• If label is specified  softmax & loss
• Add operators to compute
derivative w.r.t. loss
• Add optimizer to translate gradients
into model weight updates

Training loop

Scaling up
• Weak scaling vs strong scaling
• Here we focus on weak scaling
• Data vs model parallelism
• Here we focus on data parallelism
• Modes:
• Single machine / single GPU
• Single machine / multi GPU
• Multi machine / single GPU
• Multi machine / multi GPU

Inefficiencies of small batches
0
50
100
150
200
250
300
350
400
450
0 8 16 24 32 40 48 56 64
GPU throughput per batch size
approx images/sec

Parallelizing the model
• Instantiates 1 model per device
• Batch size multiplied by len(devices)
• Gradients are reduced (averaged) before applying weight updates

L1 L2 L3 L3b L2b L1b U3 U2 U1
Forward Backward Update

L1 L2 L3 L3b L2b L1b U3 U2 U1R3 R2 R1
L1 L2 L3 L3b L2b L1b U3 U2 U1R3 R2 R1
Forward Backward Reduce Update

L1 L2 L3 L3b L2b L1b
U3 U2 U1R3 R2 R1
L1 L2 L3 L3b L2b L1b
U3 U2 U1R3 R2 R1
Forward Backward
Reduce and update

• Near-linear scaling (e.g. ~98%)
• Depends on efficiency of gradient averaging
• Every other operator executes locally
• Faster GPUs means more pressure on image input
• Throughput on p3.16xlarge: ~2600 images/sec

Parallelizing for multi-machine
• Key/value store for rendezvous
• TCP for cross-machine reduction
• Same “Parallelize” call

Multi-machine
• Scheduling (depends on your environment):
• Amazon EC2 Container Service
• SLURM for MPI on HPC clusters
• Person typing SSH commands
• Etcetera…
• Rendezvous:
• Let instances find each other once started
• Use key/value store, e.g. Redis, Amazon ElastiCache

Multi-machine rendezvous
$ python trainer.py –n=2 –id=0
# Set address at 0/1
# Wait for address at 1/0
# Get address at 1/0
# Byte-compare sockaddr
# If <: close() and connect()
# If >: accept() and close()
# Socket is connected!
$ python trainer.py –n=2 –id=1
# Set address at 1/0
# Wait for address at 0/1
# Get address at 0/1
# Byte-compare sockaddr
# If <: close() and connect()
# If >: accept() and close()
# Socket is connected!

Multi-machine reduction
• allreduce in 3 stages:
• Local reduce (reduce from all GPUs to system memory)
• Allreduce across machines
• Local broadcast (broadcast from system memory to all GPUs)
• Single allreduce per buffer in the model
• Runtime depends on size of the buffer and network speed

Multi-machine reduction
0
10
20
30
40
50
60
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Parameter size distribution for ResNet-50 (N=214)
Parameter size (power of 2)
< 4k > 4k

0
0.25
0.5
0.75
1
100 200 300 400
Allreduce (milliseconds)
scaling efficency = (Tf + Tb) / (Tf + max(Tb, M/B))

Multi-machine observations
• Single slow machine slows down entire collective
• Multi-machine scales well only if:
• Time spent in backwards pass is <= time spent reducing gradients
• Larger model size means more time on the network
• Larger global batch size requires tuning
• Current state of the art is 32k on ImageNet dataset
(by Preferred Networks; https://arxiv.org/pdf/1711.04325.pdf)

Caffe2 on AWS

Caffe2 on AWS
• Amazon Deep Learning AMI
• CUDA 9 / cuDNN 7 / NCCL 2
• Amazon ElastiCache for VM rendezvous
• VPC for private network
• (optional) EFS for storing checkpoints

Caffe2 on AWS
• Use Caffe2 installed on AMI (stable)
• Use Caffe2 Docker image (stable & nightly)
• nvidia-docker for execution

Caffe2 on AWS
• Data input is an open problem
• Small datasets in RAM or local disk
• Larger datasets off of network (e.g. S3)

Getting Started

Key takeaways & call to action
Amazon AI is well-optimized to support Caffe2 and many other frameworks
The P3 Instance brings a leap forward in performance for deep learning
Distributed deep learning with Caffe2 enables large-scale training in hours instead
of days/weeks.
Call to action:
Get started with Caffe2 http://caffe2.ai/
Use the Deep Learning AMI  https://aws.amazon.com/amazon-ai/amis/

The Deep Learning Revolution
Terrence Sejnowski, The Salk Institute for Biological Studies
Eye, Robot: Computer Vision and Autonomous
Robotics
Aaron Ames & Pietro Perona, California Institute of Technology
Exploiting the Power of Language
Alexander Smola, Amazon Web Services
Reducing Supervision: Making More with Less
Martial Hebert, Carnegie Mellon University
Learning Where to Look in Video
Kristen Grauman, University of Texas
Look, Listen, Learn: The Intersection of Vision and
Sound
Antonio Torralba, MIT
Investing in the Deep Learning Future
Matt Ocko, Data Collective Venture Capital
Thursday, November 30th
1:00 - 5:00pm | Venetian, Ballroom F
https://reinvent.awsevents.com/learn/deep-learning-summit/

Reminder: please fill out your surveys

Thank you!
C L I C K T O A D D T E X T
C L I C K T O A D D T E X T

Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017

Similar to Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017