[251] implementing deep learning using cu dnn

Implementing Deep
Learning using cuDNN
이예하

VUNO Inc.
1

Contents
1. Deep Learning Review

2. Implementation on GPU using cuDNN

3. Optimization Issues

4. Introduction to VUNO-Net
2

Brief History of Neural Network
1940 1950 1960 1970 1980 1990 2000
Golden Age Dark Age (“AI Winter”)
Electronic Brain
1943
S. McCulloch - W. Pitts
• Adjustable Weights
• Weights are not Learned
1969
XOR
Problem
M. Minsky - S. Papert
• XOR Problem
Multi-layered
Perceptron
(Backpropagation)
1986
D. Rumelhart - G. Hinton - R. Wiliams
• Solution to nonlinearly separable problems
• Big computation, local optima and overfitting
Backward Error
Foward Activity
V. Vapnik - C. Cortes
1995
SVM
• Limitations of learning prior knowledge
• Kernel function: Human Intervention
Deep Neural Network
(Pretraining)
2006
G. Hinton - S. Ruslan
• Hierarchical feature Learning
2010
Perceptron
1957
F. Rosenblatt
• Learnable Weights and Threshold
ADALINE
1960
B. Widrow - M. Hoff
4

Machine/Deep Learning is eating the world!
5

• Restricted Boltzmann machine
• Auto-encoder
• Deep belief network

• Deep Boltzmann machines

• Generative stochastic networks

• Convolutional neural networks

• Recurrent neural networks
Building Blocks
6

• Receptive Field(Huber & Wiesel, 1959) 
 
• Neocognitron (Fukushima, 1980)
Convolutional Neural Networks
7

• LeNet-5 (Yann LeCun, 1998)
8

• Alex Net (Alex Krizhevsky et. al., 2012) 
 
 
• Clarifai (Matt Zeiler and Rob Fergus, 2013)
9

• Detection/Classification and Localization

• 1.2M for Training/50K for Validation

• 1000 Classes
ImageNet Challenge
10
Error rate Remark
Human 5.1% Karpathy
google (2014/8) 6.67% Inception
baidu (2015/1) 5.98% Data Augmentation
ms (2015/2) 4.94%
SPP, PReLU,  
Weight Initialization
google (2015/2) 4.82% Batch Normalization
baidu (2015/5) 4.58% Cheating

Network
• Softmax Layer (Output)

• Fully Connected Layer

• Pooling Layer

• Convolution Layer
Layer
• Input / Output

• Weights

• Neuron activationVGG (K. Simonyan and A. Zisserman, 2015)
11

Neural Network
12
Forward Pass
Softmax Layer
FC Layer
Conv Layer
Backward Pass
Softmax Layer
FC Layer
Conv Layer

Convolution Layer - Forward
x1 x4 x7
x2 x5 x8
x3 x6 x9
w1 w3
w2 w4
y1 y3
y2 y4
Filter Input
13
w1
w2
w4
w3
Output

Convolution Layer - Forward
How to evaluate the convolution layer efficiently?
1. Lower the convolutions into a matrix multiplication (cuDNN)

• There are several ways to implement convolutions efficiently

2. Fast Fourier Transform to compute the convolution (cuDNN_v3)

3. Computing the convolutions directly (cuda-convnet)
14

Fully Connected Layer - Forward
15
x2x1
y1 y2
x3
w11 w12 w13
w w w
• Matrix calculation is very fast on GPU

• cuBLAS library

Softmax Layer - Forward
16
x2x1
y1 y2
Issues
1. How to efficiently evaluate denominator?

2. Numerical instability due to the too large or too
small xk

Fortunately, softmax Layer is supported by cuDNN

Learning on Neural Network
Update network (i.e. weights) to minimize loss function
• Popular loss function for classification

• Necessary criterion
Sum-of-squared
error
Cross-entropy
error
17

Learning on Neural Network
• Due to the complexity of the loss, a closed-form solution is usually not possible

• Using iterative approach

• How to evaluate the gradient

• Error backpropagation
Gradient descent Newton’s method
18

Softmax Layer - Backward
x2x1
y1 y2
Loss function:
• Errors back-propagated from softmax Layer
can be calculated
• Subtract 1 from the output value with target
index
• This can be done efficiently using GPU
threads
19

Fully Connected Layer - Backward
20
x2x1
y1 y2
x3
w11 w w
w21 w w
Forward pass:
Error:
Gradient:
• Matrix calculation is very fast on GPU

• Element-wise multiplication can be done
efficiently using GPU thread
Error:
Gradient:

Convolution Layer - Backward
x1 x4 x7
x2 x5 x8
x3 x6 x9
w1 w3
w2 w4
y1 y3
y2 y4
Filter Input
21
w1
w2
w4
w3
Output
w4 w2
w3 w1
Forward
Error

Convolution Layer - Backward
Filter Input
22
Output
x1 x4 x7
x2 x5 x8
x3 x6 x9
w1 w3
w2 w4
Gradient
• The error and gradient can be
computed with convolution scheme
• cuDNN supports this operation

2.

Implementation on GPU
using cuDNN
23

Introduction to cuDNN
cuDNN is a GPU-accelerated library of primitives for deep
neural networks
• Convolution forward and backward
• Pooling forward and backward
• Softmax forward and backward
• Neuron activations forward and backward:
• Rectified linear (ReLU)
• Sigmoid
• Hyperbolic tangent (TANH)
• Tensor transformation functions
24

Introduction to cuDNN (version 2)
cuDNN's convolution routines aim for performance
competitive with the fastest GEMM
•Lowering the convolutions into a matrix multiplication
25(Sharan Chetlur et. al., 2015)

Introduction to cuDNN
Benchmarks
26
https://github.com/soumith/convnet-benchmarks
https://developer.nvidia.com/cudnn

Goal
Learning VGG model using cuDNN
• Data Layer
• Convolution Layer
• Pooling Layer
• Fully Connected Layer
• Softmax Layer
27

Preliminary
Common data structure for Layer
• Device memory & tensor description for input/output data & error
• Tensor Description defines dimensions of data
28
cuDNN initialize & release
cudaSetDevice( deviceID );

cudnnCreate( &cudnn_handle );
cudnnDestroy( cudnn_handle );

cudaDeviceReset();
float *d_input, *d_output, *d_inputDelta, *d_outputDelta

cudnnTensorDescriptor_t inputDesc;

cudnnTensorDescriptor_t outputDesc;

Data Layer
29
create & set Tensor Descriptor
cudnnCreateTensorDescriptor();
cudnnSetTensor4dDescriptor();
cudnnSetTensor4dDescriptor(

outputDesc,

CUDNN_TENSOR_NCHW,

CUDNN_FLOAT,

sampleCnt,

channels,

height,

width

);
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
19 20 21
22 23 24
25 26 27
28 29 30
31 32 33
34 35 36
sample #1 sample #2
channel #1
channel #2
Example: 2 images (3x3x2)

Convolution Layer
1. Initialization
30
1.1 create & set Filter Descriptor
1.2 create & set Conv Descriptor
1.3 create & set output Tensor
Descriptor
1.4 Get Convolution Algorithm
cudnnCreateFilterDescriptor(&filterDesc);
cudnnSetFilter4dDescriptor(…);
cudnnCreateConvolutionDescriptor(&convDesc);
cudnnSetConvolution2dDescriptor(…);
cudnnGetConvolution2dForwardOutputDim(…);
cudnnCreateTensorDescriptor(&dstTensorDesc);
cudnnSetTensor4dDescriptor();
cudnnGetConvolutionForwardAlgorithm(…);
cudnnGetConvolutionForwardWorkspaceSize(…);

Convolution Layer
1.1 Set Filter Descriptor
31
cudnnSetFilter4dDescriptor(

filterDesc,

CUDNN_FLOAT,

filterCnt,

input_channelCnt,

filter_height,

filter_width

);
1 2
3 4
5 6
7 8
9 10
11 12
13 14
15 16
Filter #1 Filter #2
channel #1
channel #2
Example: 2 Filters (2x2x2)

Convolution Layer
1.2 Set Convolution Descriptor
32
CUDNN_CROSS_CORRELATION

Convolution Layer
1.3 Set output Tensor Descriptor
33
• n, c, h and w indicate output dimension
• Tensor Description defines dimensions of data

Convolution Layer
1.4 Get Convolution Algorithm
34
inputDesc
outputDesc
CUDNN_CONVOLUTION_FWD_PREFER_FASTEST

Convolution Layer
2. Forward Pass
2.1 Convolution
2.2 Activation
cudnnConvolutionForward(…);
cudnnActivationForward(…);
35

Convolution Layer
2.1 Convolution
36
d_input

inputDesc
d_output

outputDesc

Convolution Layer
2.2 Activation
37
sigmoid

tanh

ReLU
outputDesc

d_output

Convolution Layer
3. Backward Pass
3.1 Activation Backward
3.2 Calculate Gradient
cudnnActivationBackward(…);
cudnnConvolutionBackwardFilter(…);
3.2 Error Backpropagation cudnnConvolutionBackwardData(…);
38

• Errors back-propagated from l+1 layer (d_outputDelta) is multiplied by
• See 22 slide (Convolution Layer - Backward)
Convolution Layer
3.1 Activation Backward
39
outputDesc

d_output
outputDesc

d_outputDelta
outputDesc

d_output
outputDesc

d_outputDelta

Convolution Layer
3.2 Calculate Gradient
40
inputDesc

d_input
outputDesc

d_outputDelta
filterDesc

d_filterGradient

Convolution Layer
3.2 Error Backpropagation
41
outputDesc

d_outputDelta
inputDesc

d_inputDelta

Pooling Layer / Softmax Layer
42
1. Initilization
2. Forward Pass
3. Backward Pass
cudnnCreatePoolingDescriptor(&poolingDesc);
cudnnSetPooling2dDescriptor(…);
cudnnPoolingForward(…);
cudnnPoolingBackward(…);
Pooling Layer
Softmax Layer
Forward Pass cudnnSoftmaxForward(…);

Optimization
Learning Very Deep ConvNet
We know the Deep ConvNet can be trained without pre-training

• weight sharing

• sparsity

• Recifier unit

But, “With fixed standard deviations very deep models have difficulties to
converges” (Kaiming He et. al., 2015)

• e.g. random initialization from Gaussian dist. with 0.01 std

• >8 convolution layers
44

Optimization
VGG (K. Simonyan and A. Zisserman, 2015) GoogleNet (C. Szegedy et. al., 2014)
45

Optimization
Initialization of Weights for Rectifier (Kaiming He et. al., 2015)
•The variance of the response in each layer
•Sufficient condition that the gradient is not exponentially large/small
•Standard deviation for initialization
(spatial filter size)2 x (filter Cnt)
46

Optimization
Case study
47
The filter number
64 0.059
128 0.042
256 0.029
512 0.021
3x3 filter
• When using 0.01, the std of the gradient
propagated from conv10 to convey
Error vanishing

Speed
Data loading & model learning
•Reducing data loading and augmentation time

•Data provider thread (dp_thread)

•Model learning thread (worker_thread)
readData()

{

if(is_dp_thread_running) pthread_join(dp_thread)

…

if(is_data_remain) pthread_create(dp_thread)

}
readData();

for(…) {

readData();

pthread_create(worker_thread)

…

pthread_join(worker_thread)

}
48

Speed
Multi-GPU
• Data parallelization v.s. Model parallelization
• Distribute the model, use the same data : Model Parallelism
• Distribute the data, use the same model : Data Parallelism
• Data parallelization & Gradient Average
• One of the easiest way to use Multi-GPU

• The result is same with using single GPU
49

Parallelism
50
Data Parallelization Model Parallelization
The good : Easy to implement

The bad : Cost of sync increases with the number of GPU
The good : Larger network can be trained 
The bad : Sync is necessary in all layers

Parallelism
51
Mixing Data Parallelization and Model Parallelization
(Krizhevsky, 2014)

Speed
Data Parallelization & Gradient Average
Forward Pass
Backward Pass
Gradient
Update
data (256)
Forward Pass
Backward Pass
Gradient
Update
data1 (128)
Forward Pass
Backward Pass
Gradient
data2(128)
Average
52

4.

Introduction to VUNO-
Net
53

VUNO-Net
Theano
Caffe
Cuda- 
ConvNet
Pylearn
RNNLIB
DL4J
Torch VUNO-Net
55

VUNO-Net
56
Structure Output Learning
Convolution
LSTM
MD-LSTM(2D)
Pooling
Spatial Pyramid Pooling
Fully Connection
Concatenation
Softmax
Regression
Connectionist Temporal Classification
Multi-GPU Support
Batch Normalization
Parametric Rectifier
Initialization for Rectifier
Stochastic Gradient Descent
Dropout
Data Augmentation
Open Source
VUNO-net Only
Features

Performance
The state-of-the-art performance at image & speech
84.3%82.2%
VUNOWR1
Speech Recognition

(TIMIT*)
82.1%79.3%
VUNOWR2
Image Classification

(CIFAR-100)
Kaggle Top 1 
(2014)
* TIMIT is a one of most popular benchmark dataset for speech recognition task (Texas Instrument - MIT)

WR1 (World Record) - “Speech Recognitino with Deep Recurrent Neural Networks”, Alex Graves, ICCASP(2013)

WR2 (World Record) - “kaggle competition: https://www.kaggle.com/c/cifar-10”
57

Application
We’ve achieved record breaking performance on medical
image analysis.
58

Application
Whole Lung Quantification
• Sliding window: Pixel level classification

• But, context information is more important
• Ongoing works
MD-LSTM Recurrent CNN
normal Emphysemav.s.
59

Application
60
Original Image (CT) VUNO Golden Standard
Example #1

Application
61
Original Image (CT) VUNO Golden Standard
Example #2

Visualization
SVM Features (22 features)
62
Histogram, Gradient, Run-length, Co-occurrence matrix, Cluster analysis,

Top-hat transformation

Visualization
Activation of top hidden layer
63
200 hidden nodes

We Are Hiring!!
65
Algorithm Engineer CUDA Programmer Application Developer
Staff MemberBusiness Developer

[251] implementing deep learning using cu dnn

More Related Content

What's hot

Similar to [251] implementing deep learning using cu dnn

More from NAVER D2

Recently uploaded

[251] implementing deep learning using cu dnn