Recent progress on
distributing deep learning
Viet-Trung Tran
KDE lab
Department of Information Systems
School of Information and Communication
Technology
1
Outline
•  State of the art
•  Overview of neural network and deep
learning 
•  Deep learning driven factors
•  Scaling deep learning
2
3
4
5
6
Perceptron
7
Feed forward neural network
8
Training algorithm 
•  while not done yet 
– pick a random training case (x, y) 
– run neural network on input x 
– modify connections to make prediction closer to
y, follow the gradient of the error w.r.t. the
connections
9
Parameter learning: back
propagation of error
•  Calculate total error at the top
•  Calculate contributions to error at each step going
backwards
10
Stochastic gradient descent
(SGD)
11
12
Fact
Anything humans can do in 0.1 sec, the right
big 10-layer network can do too
13
DEEP LEARNING DRIVEN
FACTORS
14
Big Data
15	Source:	Eric	P.	Xing
Computing resources
16
"Modern" neural networks
•  Deeper but faster training models
– Deep belief
– ConvNet
– RNN (LSTM, GRU)
17
SCALING DISTRIBUTED DEEP
LEARNING
18
Growing Model Complexity
19	Source:	Eric	P.	Xing
Objective: minimizing time to
results
•  experiment turnaround time 
•  making fast rather than optimizing resources 
20
Objective: improving results
•  Fact: increasing training examples, model
parameters, or both, can drastically improve
ultimate classification accuracy
– D. C. Ciresan, U. Meier, L. M. Gambardella, and
J. Schmidhuber. Deep big simple neural nets
excel on handwritten digit recognition. CoRR,
2010.
– R. Raina, A. Madhavan, and A. Y. Ng. Large-
scale deep unsupervised learning using graphics
processors. In ICML, 2009.
21
Scaling deep learning 
•  Leverage GPU 
•  Exploit many kinds of parallelism
– Model parallelism
– Data parallelism
22
Why scaling out
•  We can use a cluster of machines to train a
modestly sized speech model to the same
classification accuracy in less than 1/10th
the time required on a GPU
23
Model parallelism
•  Parallelism in DistBelief
24
Model parallelism [cont'd]
•  Message passing during upward and
downward phases
•  Distributed computation 
•  Performance gains are held by
communication costs 
25
26	
Source:	Jeff	Dean
Data parallelism: Downpour SGD
•  Divide the training data into a number of
subsets
•  Run a copy of the model on each of
these subsets
•  Before processing each mini-batch
–  model replica asks for up-to-date
parameters
–  processes the mini-batch
–  sending back the gradients
•  To reduce communication overhead
–  request parameter servers every nfech
steps, update every npush steps 
•  A model replica is certainly working on a
set of out-of-date parameters
27
Sandblaster
•  Coordinator assigns each of
the N model replicas a small
portion of work, much smaller
than 1/Nth of the total size of
a batch
•  Assigns replicas new portions
whenever they are free
•  Schedules multiple copies of
the outstanding portions and
uses the result from
whichever model replica
finishes first
28
AllReduce – Baidu DeepImage
2015
•  Each worker computes gradients and
maintains a subset of parameters 
•  Every node fetches up-to-date parameters
from all other nodes
•  Optimization
– Butterfly synchronization
•  Require log(N) steps
•  Last step to perform broadcasting 
29
Butterfly barrier
30
Distributed Hogwild
•  Used by Caffe.
•  Each node maintains a
local replica of all
parameters. 
•  In an iteration, node
computes gradients and
updates locally
•  Exchange updates
periodically 
31
DISTRIBUTED DEEP
LEARNING FRAMEWORK
32
Parameter server [OSDI 2014]
33
Apache Singa [2015]
•  National University of Singapore
34
Petuum CMU [ACML 2015]
35
Stale Synchronous Parallel (SSP)
36
Structure-Aware Parallelization
(Strads engine)
37
38
•  Data flow graph
•  Distributed version has
just been released
(based on gRPC)
39
Deep learning on spark
•  Deeplearning4j
•  Adatao/Amiro scaling Tensorflow on spark
•  Yahoo lab released CaffeOnSpark
•  Data parallelism
40
DEMO APPLICATIONS
41
Vietnamese OCR
•  Recognize text
line rather than
word, character
•  Very good results
with just ~20mb
model, ~30
pages 
42
Vietnamese predictive text model 
•  ~ 20 MB plain text corpus 
•  Chú hoài linh đẹp trai. Chú hoài linh
•  Chào buổi sáng
•  chị hát hay wa!! nghe thick a. 
•  chị khởi my ơi e rất la hâm mộ
•  làm gì bây giờ khi
•  chú hoài linh thật đẹp zai và chú Trấn thành đẹp qá
•  chú hoài linh thật đẹp zai và chú Phánh
43
•  ~ 14 MB plain text corpus 
•  lịch sử ghi nhớ năm 1979
•  tại hội nghị, đồng chí Phạm Ngọc Thủy Võ Văn Kiệt
•  tại hội nghị, đồng chí Hồ Chí Minh nói
•  tại hội nghị, đồng chí Võ Nguyên Giáp và đồng chí Hồ Chí
Minh đã ngồi ở 
•  tại đại hội Đảng lần thứ nhất vào năm 1945,
•  Ngay từ những ngày đầu, Đúng như nhận xét của Giáo sư
Nguyễn Văn Linh
44
CONCLUSION
45
Principles of ML System Design
•  ACML 2015. How to Go Really Big in AI: Strategies &
Principles for Distributed Machine Learning
– How to distribute?
– How to bridge computation and communication?
– How to communicate?
– What to communicate?
46
Thank you!
47
48	How	to	Go	Really	Big	in	AI:	Strategies	&	Principles	for	Distributed	Machine	
Learning

Recent progress on distributing deep learning