Recent progress on distributing deep learning

Recent progress on
distributing deep learning
Viet-Trung Tran
KDE lab
Department of Information Systems
School of Information and Communication
Technology
1

Outline
•  State of the art
•  Overview of neural network and deep
learning
•  Deep learning driven factors
•  Scaling deep learning
2

Training algorithm
•  while not done yet
– pick a random training case (x, y)
– run neural network on input x
– modify connections to make prediction closer to
y, follow the gradient of the error w.r.t. the
connections
9

Parameter learning: back
propagation of error
•  Calculate total error at the top
•  Calculate contributions to error at each step going
backwards
10

Stochastic gradient descent
(SGD)
11

Fact
Anything humans can do in 0.1 sec, the right
big 10-layer network can do too
13

DEEP LEARNING DRIVEN
FACTORS
14

Big Data
15 Source: Eric P. Xing

"Modern" neural networks
•  Deeper but faster training models
– Deep belief
– ConvNet
– RNN (LSTM, GRU)
17

SCALING DISTRIBUTED DEEP
LEARNING
18

Growing Model Complexity
19 Source: Eric P. Xing

Objective: minimizing time to
results
•  experiment turnaround time
•  making fast rather than optimizing resources
20

Objective: improving results
•  Fact: increasing training examples, model
parameters, or both, can drastically improve
ultimate classiﬁcation accuracy
– D. C. Ciresan, U. Meier, L. M. Gambardella, and
J. Schmidhuber. Deep big simple neural nets
excel on handwritten digit recognition. CoRR,
2010.
– R. Raina, A. Madhavan, and A. Y. Ng. Large-
scale deep unsupervised learning using graphics
processors. In ICML, 2009.
21

Scaling deep learning
•  Leverage GPU
•  Exploit many kinds of parallelism
– Model parallelism
– Data parallelism
22

Why scaling out
•  We can use a cluster of machines to train a
modestly sized speech model to the same
classiﬁcation accuracy in less than 1/10th
the time required on a GPU
23

Model parallelism
•  Parallelism in DistBelief
24

Model parallelism [cont'd]
•  Message passing during upward and
downward phases
•  Distributed computation
•  Performance gains are held by
communication costs
25

Data parallelism: Downpour SGD
•  Divide the training data into a number of
subsets
•  Run a copy of the model on each of
these subsets
•  Before processing each mini-batch
–  model replica asks for up-to-date
parameters
–  processes the mini-batch
–  sending back the gradients
•  To reduce communication overhead
–  request parameter servers every nfech
steps, update every npush steps
•  A model replica is certainly working on a
set of out-of-date parameters
27

Sandblaster
•  Coordinator assigns each of
the N model replicas a small
portion of work, much smaller
than 1/Nth of the total size of
a batch
•  Assigns replicas new portions
whenever they are free
•  Schedules multiple copies of
the outstanding portions and
uses the result from
whichever model replica
ﬁnishes ﬁrst
28

AllReduce – Baidu DeepImage
2015
•  Each worker computes gradients and
maintains a subset of parameters
•  Every node fetches up-to-date parameters
from all other nodes
•  Optimization
– Butterﬂy synchronization
•  Require log(N) steps
•  Last step to perform broadcasting
29

Distributed Hogwild
•  Used by Caﬀe.
•  Each node maintains a
local replica of all
parameters.
•  In an iteration, node
computes gradients and
updates locally
•  Exchange updates
periodically
31

DISTRIBUTED DEEP
LEARNING FRAMEWORK
32

Parameter server [OSDI 2014]
33

Apache Singa [2015]
•  National University of Singapore
34

Stale Synchronous Parallel (SSP)
36

Structure-Aware Parallelization
(Strads engine)
37

•  Data ﬂow graph
•  Distributed version has
just been released
(based on gRPC)
39

Deep learning on spark
•  Deeplearning4j
•  Adatao/Amiro scaling Tensorﬂow on spark
•  Yahoo lab released CaﬀeOnSpark
•  Data parallelism
40

Vietnamese OCR
•  Recognize text
line rather than
word, character
•  Very good results
with just ~20mb
model, ~30
pages
42

Vietnamese predictive text model
•  ~ 20 MB plain text corpus
•  Chú hoài linh đẹp trai. Chú hoài linh
•  Chào buổi sáng
•  chị hát hay wa!! nghe thick a.
•  chị khởi my ơi e rất la hâm mộ
•  làm gì bây giờ khi
•  chú hoài linh thật đẹp zai và chú Trấn thành đẹp qá
•  chú hoài linh thật đẹp zai và chú Phánh
43

•  ~ 14 MB plain text corpus
•  lịch sử ghi nhớ năm 1979
•  tại hội nghị, đồng chí Phạm Ngọc Thủy Võ Văn Kiệt
•  tại hội nghị, đồng chí Hồ Chí Minh nói
•  tại hội nghị, đồng chí Võ Nguyên Giáp và đồng chí Hồ Chí
Minh đã ngồi ở
•  tại đại hội Đảng lần thứ nhất vào năm 1945,
•  Ngay từ những ngày đầu, Đúng như nhận xét của Giáo sư
Nguyễn Văn Linh
44

Principles of ML System Design
•  ACML 2015. How to Go Really Big in AI: Strategies &
Principles for Distributed Machine Learning
– How to distribute?
– How to bridge computation and communication?
– How to communicate?
– What to communicate?
46

48 How to Go Really Big in AI: Strategies & Principles for Distributed Machine
Learning

Recent progress on distributing deep learning

More Related Content

What's hot

Viewers also liked

Similar to Recent progress on distributing deep learning

More from Viet-Trung TRAN

Recently uploaded

Recent progress on distributing deep learning