SlideShare a Scribd company logo
1 of 32
Download to read offline
Large-Scale Training
with GPUs at Facebook
Aapo Kyrola
Distributed AI Team @ Facebook
1
1. Quick intro to Caffe2 framework
2. Parallel Training: Async & Sync
3. Synchronous SGD with Caffe2 and GLOO
4. Case Study: How we trained Resnet-50 for Imagenet in just 1 hour
Contents
Deep Learning Frameworks by FB
Caffe2
4
• A lightweight framework for deep learning / ML / ..
• Primarily designed for production use cases and large-scale training
• Speed and low footprint
• C++ / Python based interfaces
• Supports deployment on multiple platforms
• Linux, Mac, iOS, Android and Windows
• IoT devices, Raspberry Pi, Tegra X1, ...
Caffe2 is...
5
• Describes model as a DAG of operators and blobs
• Caffe2 runtime does not have any deep learning concepts
à it just executes a DAG
• DAG also covers loss functions, data reading, metrics,
etc…
• Graph
• construction in Python, incl. auto-gradient (flexibility),
• description in Protobuf (portability),
Computational graph
6
TRAINING AS A DIRECTED GRAPH
FC
Y
DataReader (op)
b
X
W
CrossEntropy
FCGradient
IterOp
LearningRate
WeightedSum
WeightedSum
Label
W_grad b_grad
Loss
Graph Example
Parallel Training
8
Asynchronous and Synchronous SGD
Asynchronous SGD
• Parameters are updated by parallel
workers in a “best-effort” basis, “in the
background”.
• Various algorithms how to adjust learning
to handle delayed updates, such as EASGD
or Block Momentum
• Parameter Servers manage parameters
• Can be used for very large models that do
not fit in one machine.
Synchronous SGD
• Workers synchronize (”all-reduce”)
parameter gradients after each iteration
• Models are always in sync.
• Mathematically the number of workers
does not matter: computation is function
of the total batch size only.
+ Async can scale to very large clusters.
- Async requires tuning when runtime characteristics change
+ Sync result is not affected by execution: only function of total batch size
- Sync is harder to scale to large clusters
Async vs. Sync
GPUs are very fast, so we can use fewer servers for computation
à Sync SGD can scale sufficiently.
Sync SGD with Caffe2 + GLOO
11
• Simple interface: data_parallel_model (DPM) for both multi-GPU and multi-
GPU-multi-host models.
• DPM injects AllReduce and Broadcast operators to the graph
• (Caffe2 runtime does not know about being parallel – all based on operators)
• Each worker runs the same code, same DAG, in parallel.
• AllReduce & Broadcasts act as implicit barriers.
SyncSGD with Caffe2
SyncSGD with Caffe2
ConvGradient
FCGradient
fc1_grad
input_grad
conv1_grad
conv1_w_grad
fc1_w_grad
AllReduce
AllReduce
conv1_w_grad
ParamUpdate
fc1_w_grad
ParamUpdate
Parameter updates execute in
parallel with the backward
pass.
GLOO
https://github.com/facebookincubator/gloo
• Library for very fast distributed reductions: AllReduce,
Reduce, Broadcast, Allgather
• External library. Operators for Caffe2 and Pytorch.
• “Mini-MPI”
• Uses NVIDIA’s NCCL for inter-GPU reductions
• TCP/IP and RDMA transports supported
Case Study: ImageNet in 1hr
15
• Train Resnet-50 (most popular image detection architecture) on Imagenet-1K
dataset in less than hour to ~ state-of-the-art accuracy.
• On a single 8-gpu P100: ~ 1.5 days to do 90 epochs.
• Why? (A) training faster improves development iterations;
• (B) enables training with extremely large datasets in reasonable time
Goal
June, 2017
• Accuracy
• Very large mini-batch sizes believed to hurt convergence
• Scale efficiently
• Facebook uses commodity networking (i.e no InfiniBand)
• 32 x 8 P100 NVIDIA GPUs, Big Basin architecture
Challenges
”Big Basin” architecture, open sourced design
Baseline 8 GPUs
0 20 40 60 80
epochs
20
30
40
50
60
70
80
90
100
trainingerror%
kn=256, = 0.1, 23.60% 0.12
• k = #gpus
• n = per gpus batch size
• 𝜂 = learning rate
• 256 = 8 x 32
32 x 8 GPUs: same Learning Rate
0 20 40 60 80
epochs
20
30
40
50
60
70
80
90
100
trainingerror%
kn=256, = 0.1, 23.60% 0.12
kn= 8k, = 0.1, 41.78% 0.10
8192 = 256 x 32
#gpus
per gpu batch
32 x 8 GPUs: Sqrt-scaling of LR?
0 20 40 60 80
epochs
20
30
40
50
60
70
80
90
100
trainingerror%
kn=256, = 0.1, 23.60% 0.12
kn= 8k, = 0.6, 26.28% 0.03
32 x 8 GPUs: Linear Scaling of LR?
Linear Scaling + Constant LR Warmup
rapid changes in the beginning of training-> use small LR for first few epochs
Linear Scaling + Gradual LR Warmup
start from LR of 𝜂 and increase it by constant amount at each iteration so that 𝜂̂ = 𝑘𝜂	after 5 epochs
Linear LR scaling
In this case, we found 8K mini-batch to be close to maximum we could go with Linear LR scaling technique.
More tricks in the paper.
Scaling Efficiently
Efficient All-Reduce
• Resnet-50: 25 million float parameters (100mb)
• Each iteration about ~0.3 secs, backward pass run in parallel with all-
reduces à Latency is not an issue.
• Halving-Doubling algorithm by Thakur et. al. provides optimal
throughput
3x speedup vs. “ring-algorithm” for all-
reduce on 32 servers
Using 100% commodity hardware and open source software stack, 90% scaling efficiency
Follow-up Work by Others
• Already several follow-up papers to reproduce &
improve on our results
• For example: You, Gitman, Ginsburg demonstrate using
batch size up to 32K (using layer-wise adaptive learning
rate)
• Alternatives to GPUs, such as Intel Xeon Phis
On-going work
• Elasticity: survicve crashes; incrementally add nodes to
cluster then they become available
• Data input is becoming a bottleneck
• Fp16 for training
• Implement & Experiment with asynchronous algorithms
Lessons Learned
• SyncSGD can go a long way and has fewer tunable
parameters than asynchronous SGD.
• Learning Rate is the fundamental parameter when
increasing mini-batch size.
• Utilize the inherent parallelism in training to hide
latency.
• Commodity hardware can go a long way
Thank You!
Caffe2.ai

More Related Content

What's hot

Kubeflow Control Plane 中文
Kubeflow Control Plane 中文Kubeflow Control Plane 中文
Kubeflow Control Plane 中文Weiqiang Zhuang
 
Serving models from AWS Lambda
Serving models from AWS LambdaServing models from AWS Lambda
Serving models from AWS LambdaAlexey Grigorev
 
Kubeflow Distributed Training and HPO
Kubeflow Distributed Training and HPOKubeflow Distributed Training and HPO
Kubeflow Distributed Training and HPOAnimesh Singh
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowHorovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowDatabricks
 
TFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformTFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformShunya Ueta
 
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLScaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLSeldon
 
AI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using KubeflowAI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using KubeflowSteve Guhr
 
Hkube
HkubeHkube
Hkubehkube
 
Introduction to GraalVM
Introduction to GraalVMIntroduction to GraalVM
Introduction to GraalVMSHASHI KUMAR
 
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...Costanoa Ventures
 
Tableapp architecture migration story for GCPUG.TW
Tableapp architecture migration story for GCPUG.TWTableapp architecture migration story for GCPUG.TW
Tableapp architecture migration story for GCPUG.TWYen-Wen Chen
 
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...Flink Forward
 
From AWS to GCP, TABLEAPP Architecture Story
From AWS to GCP, TABLEAPP Architecture StoryFrom AWS to GCP, TABLEAPP Architecture Story
From AWS to GCP, TABLEAPP Architecture StoryYen-Wen Chen
 
Native Java with GraalVM
Native Java with GraalVMNative Java with GraalVM
Native Java with GraalVMSylvain Wallez
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan
 
[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...
[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...
[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...Naoki (Neo) SATO
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward
 

What's hot (20)

Kubeflow Control Plane 中文
Kubeflow Control Plane 中文Kubeflow Control Plane 中文
Kubeflow Control Plane 中文
 
KFServing and Feast
KFServing and FeastKFServing and Feast
KFServing and Feast
 
Serving models from AWS Lambda
Serving models from AWS LambdaServing models from AWS Lambda
Serving models from AWS Lambda
 
Kubeflow Distributed Training and HPO
Kubeflow Distributed Training and HPOKubeflow Distributed Training and HPO
Kubeflow Distributed Training and HPO
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowHorovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
 
TFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformTFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platform
 
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLScaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
 
AI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using KubeflowAI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using Kubeflow
 
Hkube
HkubeHkube
Hkube
 
Introduction to GraalVM
Introduction to GraalVMIntroduction to GraalVM
Introduction to GraalVM
 
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
 
Tableapp architecture migration story for GCPUG.TW
Tableapp architecture migration story for GCPUG.TWTableapp architecture migration story for GCPUG.TW
Tableapp architecture migration story for GCPUG.TW
 
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
 
From AWS to GCP, TABLEAPP Architecture Story
From AWS to GCP, TABLEAPP Architecture StoryFrom AWS to GCP, TABLEAPP Architecture Story
From AWS to GCP, TABLEAPP Architecture Story
 
Native Java with GraalVM
Native Java with GraalVMNative Java with GraalVM
Native Java with GraalVM
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...
[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...
[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
 

Similar to Large-Scale GPU Training at Facebook in Just 1 Hour

2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917Bill Liu
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performances.rohit
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusJakob Karalus
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringKeith Kraus
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloudNicolas Poggi
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureMani Goswami
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016Pierre Mavro
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Chris Fregly
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Rust kafka-5-2019-unskip
Rust kafka-5-2019-unskipRust kafka-5-2019-unskip
Rust kafka-5-2019-unskipGerard Klijs
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanJimin Hsieh
 
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...Redis Labs
 

Similar to Large-Scale GPU Training at Facebook in Just 1 Hour (20)

2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Rust kafka-5-2019-unskip
Rust kafka-5-2019-unskipRust kafka-5-2019-unskip
Rust kafka-5-2019-unskip
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
 
Callgraph analysis
Callgraph analysisCallgraph analysis
Callgraph analysis
 
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...
 

More from Faisal Siddiqi

Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
 
Dropbox Talk at Netflix ML Platform Meetup Spe 2019
Dropbox Talk at Netflix ML Platform Meetup Spe 2019Dropbox Talk at Netflix ML Platform Meetup Spe 2019
Dropbox Talk at Netflix ML Platform Meetup Spe 2019Faisal Siddiqi
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Netflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time TravelNetflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time TravelFaisal Siddiqi
 
Machine learning for Netflix recommendations talk at SF Make School
Machine learning for Netflix recommendations talk at SF Make SchoolMachine learning for Netflix recommendations talk at SF Make School
Machine learning for Netflix recommendations talk at SF Make SchoolFaisal Siddiqi
 

More from Faisal Siddiqi (7)

Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019
 
Dropbox Talk at Netflix ML Platform Meetup Spe 2019
Dropbox Talk at Netflix ML Platform Meetup Spe 2019Dropbox Talk at Netflix ML Platform Meetup Spe 2019
Dropbox Talk at Netflix ML Platform Meetup Spe 2019
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Netflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time TravelNetflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time Travel
 
Machine learning for Netflix recommendations talk at SF Make School
Machine learning for Netflix recommendations talk at SF Make SchoolMachine learning for Netflix recommendations talk at SF Make School
Machine learning for Netflix recommendations talk at SF Make School
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Large-Scale GPU Training at Facebook in Just 1 Hour

  • 1. Large-Scale Training with GPUs at Facebook Aapo Kyrola Distributed AI Team @ Facebook 1
  • 2. 1. Quick intro to Caffe2 framework 2. Parallel Training: Async & Sync 3. Synchronous SGD with Caffe2 and GLOO 4. Case Study: How we trained Resnet-50 for Imagenet in just 1 hour Contents
  • 5. • A lightweight framework for deep learning / ML / .. • Primarily designed for production use cases and large-scale training • Speed and low footprint • C++ / Python based interfaces • Supports deployment on multiple platforms • Linux, Mac, iOS, Android and Windows • IoT devices, Raspberry Pi, Tegra X1, ... Caffe2 is... 5
  • 6. • Describes model as a DAG of operators and blobs • Caffe2 runtime does not have any deep learning concepts à it just executes a DAG • DAG also covers loss functions, data reading, metrics, etc… • Graph • construction in Python, incl. auto-gradient (flexibility), • description in Protobuf (portability), Computational graph 6
  • 7. TRAINING AS A DIRECTED GRAPH FC Y DataReader (op) b X W CrossEntropy FCGradient IterOp LearningRate WeightedSum WeightedSum Label W_grad b_grad Loss Graph Example
  • 9. Asynchronous and Synchronous SGD Asynchronous SGD • Parameters are updated by parallel workers in a “best-effort” basis, “in the background”. • Various algorithms how to adjust learning to handle delayed updates, such as EASGD or Block Momentum • Parameter Servers manage parameters • Can be used for very large models that do not fit in one machine. Synchronous SGD • Workers synchronize (”all-reduce”) parameter gradients after each iteration • Models are always in sync. • Mathematically the number of workers does not matter: computation is function of the total batch size only.
  • 10. + Async can scale to very large clusters. - Async requires tuning when runtime characteristics change + Sync result is not affected by execution: only function of total batch size - Sync is harder to scale to large clusters Async vs. Sync GPUs are very fast, so we can use fewer servers for computation à Sync SGD can scale sufficiently.
  • 11. Sync SGD with Caffe2 + GLOO 11
  • 12. • Simple interface: data_parallel_model (DPM) for both multi-GPU and multi- GPU-multi-host models. • DPM injects AllReduce and Broadcast operators to the graph • (Caffe2 runtime does not know about being parallel – all based on operators) • Each worker runs the same code, same DAG, in parallel. • AllReduce & Broadcasts act as implicit barriers. SyncSGD with Caffe2
  • 14. GLOO https://github.com/facebookincubator/gloo • Library for very fast distributed reductions: AllReduce, Reduce, Broadcast, Allgather • External library. Operators for Caffe2 and Pytorch. • “Mini-MPI” • Uses NVIDIA’s NCCL for inter-GPU reductions • TCP/IP and RDMA transports supported
  • 15. Case Study: ImageNet in 1hr 15
  • 16. • Train Resnet-50 (most popular image detection architecture) on Imagenet-1K dataset in less than hour to ~ state-of-the-art accuracy. • On a single 8-gpu P100: ~ 1.5 days to do 90 epochs. • Why? (A) training faster improves development iterations; • (B) enables training with extremely large datasets in reasonable time Goal June, 2017
  • 17. • Accuracy • Very large mini-batch sizes believed to hurt convergence • Scale efficiently • Facebook uses commodity networking (i.e no InfiniBand) • 32 x 8 P100 NVIDIA GPUs, Big Basin architecture Challenges ”Big Basin” architecture, open sourced design
  • 18. Baseline 8 GPUs 0 20 40 60 80 epochs 20 30 40 50 60 70 80 90 100 trainingerror% kn=256, = 0.1, 23.60% 0.12 • k = #gpus • n = per gpus batch size • 𝜂 = learning rate • 256 = 8 x 32
  • 19. 32 x 8 GPUs: same Learning Rate 0 20 40 60 80 epochs 20 30 40 50 60 70 80 90 100 trainingerror% kn=256, = 0.1, 23.60% 0.12 kn= 8k, = 0.1, 41.78% 0.10 8192 = 256 x 32 #gpus per gpu batch
  • 20. 32 x 8 GPUs: Sqrt-scaling of LR? 0 20 40 60 80 epochs 20 30 40 50 60 70 80 90 100 trainingerror% kn=256, = 0.1, 23.60% 0.12 kn= 8k, = 0.6, 26.28% 0.03
  • 21. 32 x 8 GPUs: Linear Scaling of LR?
  • 22. Linear Scaling + Constant LR Warmup rapid changes in the beginning of training-> use small LR for first few epochs
  • 23. Linear Scaling + Gradual LR Warmup start from LR of 𝜂 and increase it by constant amount at each iteration so that 𝜂̂ = 𝑘𝜂 after 5 epochs
  • 24. Linear LR scaling In this case, we found 8K mini-batch to be close to maximum we could go with Linear LR scaling technique. More tricks in the paper.
  • 26. Efficient All-Reduce • Resnet-50: 25 million float parameters (100mb) • Each iteration about ~0.3 secs, backward pass run in parallel with all- reduces à Latency is not an issue. • Halving-Doubling algorithm by Thakur et. al. provides optimal throughput
  • 27. 3x speedup vs. “ring-algorithm” for all- reduce on 32 servers
  • 28. Using 100% commodity hardware and open source software stack, 90% scaling efficiency
  • 29. Follow-up Work by Others • Already several follow-up papers to reproduce & improve on our results • For example: You, Gitman, Ginsburg demonstrate using batch size up to 32K (using layer-wise adaptive learning rate) • Alternatives to GPUs, such as Intel Xeon Phis
  • 30. On-going work • Elasticity: survicve crashes; incrementally add nodes to cluster then they become available • Data input is becoming a bottleneck • Fp16 for training • Implement & Experiment with asynchronous algorithms
  • 31. Lessons Learned • SyncSGD can go a long way and has fewer tunable parameters than asynchronous SGD. • Learning Rate is the fundamental parameter when increasing mini-batch size. • Utilize the inherent parallelism in training to hide latency. • Commodity hardware can go a long way