❤ Convolutional Neural Network
Presented by Junho Cho (@junhocho)
© Junho Cho, 2016 1
Convolutional?
© Junho Cho, 2016 2
This is Neural Network (NN)
© Junho Cho, 2016 3
Machine Learning =
Feature Representation + Classifier
such as ...
© Junho Cho, 2016 4
SIFT feature + SVM
© Junho Cho, 2016 5
HoG feature + Random Forest
© Junho Cho, 2016 6
HoG feature + SVM
© Junho Cho, 2016 7
SIFT feature + Random Forest
© Junho Cho, 2016 8
CNN feature + SVM
© Junho Cho, 2016 9
SIFT feature + Neural Network
© Junho Cho, 2016 10
Problem
Lots of feature extractor, hard feature engineering
Which classifier?
Framework is extremely modular
© Junho Cho, 2016 11
Deep Learning
enables learning representation and classifier
End-to-End (Learn together)
© Junho Cho, 2016 12
To be End-to-End
All part in network is differentiable
For Back-propagation
© Junho Cho, 2016 13
Much easier training (relatively too past)
Don't need to extract feature yourself
Just let the Neural Network
Learn feature by itself!
Including classifier!
© Junho Cho, 2016 14
Require less domain knowledge
Apply to various domain!
Speech, Text, Image, Reinforcement Learning
© Junho Cho, 2016 15
DenseCap : localize and caption (Vision + Natural Language)
© Junho Cho, 2016 16
and Better performance ⭐
© Junho Cho, 2016 17
This is Typical Convolutional
Neural Network (CNN)
[LeNet-5, LeCun 1980]
© Junho Cho, 2016 18
Baic CNN is
composed of
1. Convolution (Conv)
2. Pooling (Subsampling)
3. Rectified Linear Unit (ReLU)
4. Fully Connected layers (FC)
© Junho Cho, 2016 19
Basic CNN
[(Conv-ReLU)*n - POOL] * m - (FC-ReLU) * k - loss
that's it! for real
© Junho Cho, 2016 20
© Junho Cho, 2016 21
Will explain these computations later
© Junho Cho, 2016 22
CNN usage?
Mostly on
Images!
© Junho Cho, 2016 23
Used as image recognizer
• Object Classification(Recognition)
• Object Detection
• Image Captioning
• Visual Q&A
• Even in Alphago
© Junho Cho, 2016 24
Where to begin ...
the History!
© Junho Cho, 2016 25
[LeNet-5, LeCun 1980]
© Junho Cho, 2016 26
Shallow CNN existed.
But Deep CNN wasn't popular
1. Computationally hard at the moment
2. Vanishing gradient problem: Can't train deep net
© Junho Cho, 2016 27
We wanna go
deeeeeeeper
© Junho Cho, 2016 28
Deep Learning
This is now solved with several
advantages
1. Lots of data (ImageNet)
2. Powerful computation (GPU)
3. Some practical technique (ReLU, Dropout)
© Junho Cho, 2016 29
What is ImageNet?
© Junho Cho, 2016 30
http://image-net.org
ImageNet is an image database containing 14,197,122 images
with labels.
ILSVRC : challenge of Classification/Localization/Detection
© Junho Cho, 2016 31
ImageNet Top-5 Classification
Error
© Junho Cho, 2016 32
Go deeeeeeeper
© Junho Cho, 2016 33
Where we use it
again?
© Junho Cho, 2016 34
© Junho Cho, 2016 35
© Junho Cho, 2016 36
© Junho Cho, 2016 37
© Junho Cho, 2016 38
© Junho Cho, 2016 39
© Junho Cho, 2016 40
© Junho Cho, 2016 41
© Junho Cho, 2016 42
Neural Art
video link
© Junho Cho, 2016 43
Now let's understand the computation
Conv, ReLU, Pool, FC
© Junho Cho, 2016 44
Reminder: Perceptron
© Junho Cho, 2016 45
Fully Connected (FC) layers
Densely connected.
Compute all input neurons, Spatial information disappears.
© Junho Cho, 2016 46
FC is matrix multiplcation.
output = tf.matmul(input, W)
input size :
output size :
then W has parameters
© Junho Cho, 2016 47
Convolution
It keeps spatial information using convolutional filters
© Junho Cho, 2016 48
Reminder: 1D Convolution
© Junho Cho, 2016 49
But We do 2D
convolution on
images
© Junho Cho, 2016 50
© Junho Cho, 2016 51
© Junho Cho, 2016 52
© Junho Cho, 2016 53
© Junho Cho, 2016 54
© Junho Cho, 2016 55
© Junho Cho, 2016 56
© Junho Cho, 2016 57
© Junho Cho, 2016 58
© Junho Cho, 2016 59
Basically train these Conv filters
Via Back-propagation
© Junho Cho, 2016 60
More hyperparameters in Conv
1. stride : step of Conv filter
2. padding : add borders (usually zero) of input
© Junho Cho, 2016 61
Summup of Conv
• Slide Conv filter over input.
• Maintain spatial info with shared filter weight
• Parameters: kernel_size, filterNum, padding, stride
• Learning parameter: of kernel_size
© Junho Cho, 2016 62
In TensorFlow
tf.nn.conv2d(input, filter,
strides, padding,
use_cudnn_on_gpu=None,
data_format=None,
name=None)
© Junho Cho, 2016 63
© Junho Cho, 2016 64
© Junho Cho, 2016 65
Quiz
© Junho Cho, 2016 66
© Junho Cho, 2016 67
• kernel_size = 3
• Stride = 2
• padding = 0
© Junho Cho, 2016 68
© Junho Cho, 2016 69
• kernel_size = 3
• Stride = 2
• padding = 1
© Junho Cho, 2016 70
© Junho Cho, 2016 71
• kernel_size = 3
• Stride = 1
• padding = 1
© Junho Cho, 2016 72
© Junho Cho, 2016 73
© Junho Cho, 2016 74
© Junho Cho, 2016 75
© Junho Cho, 2016 76
Compare FC and Conv
Local Invariance
figure credit to CS231n
© Junho Cho, 2016 77
CNN is powerful because
• Local invariance
the convolution filters are ‘sliding’ over the input image, the
exact location of the object we want to find does not matter
much.
• Less parameters
This helps preventing overfitting.
© Junho Cho, 2016 78
Basically Conv is special case of FC.
• Doubly block circulant matrix
• Toeplitz matrix
Conv can be implemented with Matrix Multiplication
© Junho Cho, 2016 79
We apply ReLU to have non-linearity of model.
© Junho Cho, 2016 80
We needs activation function after Conv
© Junho Cho, 2016 81
Rectified Linear Unit
this is activation function, replacing sigmoid
© Junho Cho, 2016 82
single Perceptron
: XOR is unsolvable
1. Thus bend the dimension (add non-linearity!)
2. Multi Layer Perceptron
© Junho Cho, 2016 83
Activation functions are like
switch.
On/Off of each neurons
Apply Non-linearity of Neural Network
Let's test it!
© Junho Cho, 2016 84
sigmoid has Gradient Vanishing problem.
ReLU is practically the best activation function in CNN
© Junho Cho, 2016 85
But also sigmoid and tanh is occasionally used
Also there is PReLU, LeakyReLU
© Junho Cho, 2016 86
Pooling
• makes the representations smaller and more manageable
• Reduces number of parameters and prevent overfitting
• operates over each activation map independently
• Average, L2-norm, Max-pooling
© Junho Cho, 2016 87
Max-pooling
Normally use max-pooling because generally performs better
© Junho Cho, 2016 88
However, recent model replace pooling with strided-Conv
© Junho Cho, 2016 89
Dropout
While training, turn off neuron randomly.
Regularizes the model.
© Junho Cho, 2016 90
How they look in code?
def model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden):
l1a = tf.nn.relu(tf.nn.conv2d(X, w,
strides=[1, 1, 1, 1], padding='SAME'))
# l1a output shape=(?, input_height, input_width, number_of_channels_layer1)
l1 = tf.nn.max_pool(l1a, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
# l1 output shape=(?, input_height/2, input_width/2, number_of_channels_layer1)
l1 = tf.nn.dropout(l1, p_keep_conv)
l2a = tf.nn.relu(tf.nn.conv2d(l1, w2,
strides=[1, 1, 1, 1], padding='SAME'))
# l2a output shape=(?, input_height/2, input_width/2, number_of_channels_layer2)
l2 = tf.nn.max_pool(l2a, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
# l2 shape=(?, input_height/4, input_width/4, number_of_channels_layer2)
l2 = tf.nn.dropout(l2, p_keep_conv)
l3a = tf.nn.relu(tf.nn.conv2d(l2, w3,
strides=[1, 1, 1, 1], padding='SAME'))
# l3a shape=(?, input_height/4, input_width/4, number_of_channels_layer3)
l3 = tf.nn.max_pool(l3a, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
# l3 shape=(?, input_height/8, input_width/8, number_of_channels_layer3)
l3 = tf.reshape(l3, [-1, w4.get_shape().as_list()[0]])
# flatten to (?, input_height/8 * input_width/8 * number_of_channels_layer3)
l3 = tf.nn.dropout(l3, p_keep_conv)
l4 = tf.nn.relu(tf.matmul(l3, w4))
#fully connected_layer
l4 = tf.nn.dropout(l4, p_keep_hidden)
pyx = tf.matmul(l4, w_o)
return pyx
© Junho Cho, 2016 91
Let's analyze famous architecutre
© Junho Cho, 2016 92
AlexNet
© Junho Cho, 2016 93
© Junho Cho, 2016 94
© Junho Cho, 2016 95
VGGNet
© Junho Cho, 2016 96
© Junho Cho, 2016 97
© Junho Cho, 2016 98
© Junho Cho, 2016 99
Only uses 3x3 Conv layers.
In point of receptive field,
two 3x3 is better than one 5x5
or three 3x3 is better than 7x7
because of lesser parameters.
Too deep network is hard to train.
Performance: VGG16 > VGG19
Many researchers finetuned this net for
their task, because of simplicity of the
net architecture.
© Junho Cho, 2016 100
GoogLeNet
© Junho Cho, 2016 101
Use inception model.
Enhance scale invariance.
© Junho Cho, 2016 102
ResNetYou can actually train super deep network!
© Junho Cho, 2016 103
© Junho Cho, 2016 104
Skip-connection optimizes training!
Learn residual transformation.
© Junho Cho, 2016 105
© Junho Cho, 2016 106
Batch Normalization
© Junho Cho, 2016 107
© Junho Cho, 2016 108
To train your Neural Network.
Keep in mind
1. Data-preparation : preprocessing, input & output
2. Architecture : dimension check
3. Loss function : CrossEntropy? L2?
4. Optimizer : SGD, Adam, ...
© Junho Cho, 2016 109
Optimizer
© Junho Cho, 2016 110
© Junho Cho, 2016 111
© Junho Cho, 2016 112
ADAM optimizer
currently rules.
Just use it.
© Junho Cho, 2016 113
Just replace
SGD: tf.train.GradientDescentOptimizer
with ADAM: tf.train.AdamOptimizer
slide from link
© Junho Cho, 2016 114
import tensorflow as tf
Let's do practice!
© Junho Cho, 2016 115
Import libraries!
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
© Junho Cho, 2016 116
function for variables
def init_weights(shape):
return tf.Variable(tf.truncated_normal(shape, stddev=0.01))
© Junho Cho, 2016 117
Define model
def model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden):
l1a = tf.nn.relu(tf.nn.conv2d(X, w,
strides=[1, 1, 1, 1], padding='SAME'))
# l1a output shape=(?, input_height, input_width, number_of_channels_layer1)
l1 = tf.nn.max_pool(l1a, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
# l1 output shape=(?, input_height/2, input_width/2, number_of_channels_layer1)
l1 = tf.nn.dropout(l1, p_keep_conv)
l2a = tf.nn.relu(tf.nn.conv2d(l1, w2,
strides=[1, 1, 1, 1], padding='SAME'))
# l2a output shape=(?, input_height/2, input_width/2, number_of_channels_layer2)
l2 = tf.nn.max_pool(l2a, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
# l2 shape=(?, input_height/4, input_width/4, number_of_channels_layer2)
l2 = tf.nn.dropout(l2, p_keep_conv)
l3a = tf.nn.relu(tf.nn.conv2d(l2, w3,
strides=[1, 1, 1, 1], padding='SAME'))
# l3a shape=(?, input_height/4, input_width/4, number_of_channels_layer3)
l3 = tf.nn.max_pool(l3a, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
# l3 shape=(?, input_height/8, input_width/8, number_of_channels_layer3)
l3 = tf.reshape(l3, [-1, w4.get_shape().as_list()[0]])
# flatten to (?, input_height/8 * input_width/8 * number_of_channels_layer3)
l3 = tf.nn.dropout(l3, p_keep_conv)
l4 = tf.nn.relu(tf.matmul(l3, w4))
#fully connected_layer
l4 = tf.nn.dropout(l4, p_keep_hidden)
pyx = tf.matmul(l4, w_o)
return pyx
© Junho Cho, 2016 118
Prepare Data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=False)
X_trn, Y_trn, X_test, Y_test = mnist.train.images, mnist.train.labels, mnist.test.images, mnist.test.labels
X_trn = X_trn.reshape(-1, 28, 28, 1) # 28x28x1 input img
X_test = X_test.reshape(-1, 28, 28, 1) # 28x28x1 input img
© Junho Cho, 2016 119
Initialize
w = init_weights([3, 3, 1, 32])
w2 = init_weights([3, 3, 32, 64])
w3 = init_weights([3, 3, 64, 128])
w4 = init_weights([128 * 4 * 4, 625])
w_o = init_weights([625, 10])
p_keep_conv = tf.placeholder(tf.float32)
p_keep_hidden = tf.placeholder(tf.float32)
py_x = model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden)
© Junho Cho, 2016 120
Define loss function
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(py_x, Y))
Select Optimizer
train_op = tf.train.AdagradOptimizer(learning_rate=0.05).minimize(loss)
© Junho Cho, 2016 121
Then Train!
of course with lots of debugging
© Junho Cho, 2016 122
Monitor accuracy of my model
correct = tf.nn.in_top_k(py_x, Y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
Monitor my loss function drop
x = np.arange(50)
plt.plot(x, trn_loss_list)
plt.plot(x, test_loss_list)
plt.title("cross entropy loss")
plt.legend(["train loss", "test_loss"])
plt.xlabel("epoch")
plt.ylabel("cross entropy")
© Junho Cho, 2016 123
Tips for start training
• Good weight initialization
• Random gaussian initialization
• Famous initialization: Xavier, He
• Or use pretrained network
• ImageNet pretrained VGG16 !
© Junho Cho, 2016 124
• Don't ignore data preparation
• you need lots of data
• Most annoying and difficult part
• Preprocessing: normalize input image
• 0~256 >> -1~1
© Junho Cho, 2016 125
• Visualize Training samples and loss function
• Don't just pray for result
• Monitoring is necessary
© Junho Cho, 2016 126
• Check gradient & Adjust learning rate
• NaN ??!!
• gradient explosion: lower learning rate
• Don't be too much creative
• start from base code/architecture like VGG
• Ensemble
• gain extra 2% performance more
© Junho Cho, 2016 127
Other Deep Learning FrameWorks?
• Torch
• TensorFlow (Keras, Slim, PrettyTensor, ...)
• Caffe
• Theano (Keras, Lasagne)
• MXnet, CNTK, PaddlePaddle, Chainer, ...
© Junho Cho, 2016 128
ConvNet benchmark
© Junho Cho, 2016 129
HardwareNo CPU, Use GPU
Not AMD, Use NVidia Graphic Cards
© Junho Cho, 2016 130
© Junho Cho, 2016 131
TitanX
© Junho Cho, 2016 132
Graphic cards
VRAM is the most important
TitanX: 12GB, 1080/1070: 8GB, 980/970: 4GB
© Junho Cho, 2016 133
Most bottleneck of training time is actually
DATA I/O or CPU
Nice data loading codes already implemented
© Junho Cho, 2016 134
Some more CNN applications!
© Junho Cho, 2016 135
Detection
R-CNN, Fast-RCNN, Faster-RCNN
YOLO, Single Shot Detector
© Junho Cho, 2016 136
Segmentation
Fully Convolutional Network, DeepMask
© Junho Cho, 2016 137
© Junho Cho, 2016 138
Deconvolutionalso known as Up-Convoluion / Transpose-Convolution
Computation is same as Back-propagation of Convolution
It increase spatial dimension
© Junho Cho, 2016 139
ETC© Junho Cho, 2016 140
© Junho Cho, 2016 141
Class Activation Mapping
© Junho Cho, 2016 142
NeuralArt
© Junho Cho, 2016 143
Adversarial Examples
You can fool CNN classifier.
© Junho Cho, 2016 144
© Junho Cho, 2016 145
© Junho Cho, 2016 146
Generative Adversarial Network
© Junho Cho, 2016 147
© Junho Cho, 2016 148
© Junho Cho, 2016 149
GAN formulation
Generator Network and Discriminator Network fools each
other.
Finally, Generator produces samples that fool Discriminator.
© Junho Cho, 2016 150
SRGAN : Super Resolution GAN
© Junho Cho, 2016 151

Convolutional Neural Network

  • 1.
    ❤ Convolutional NeuralNetwork Presented by Junho Cho (@junhocho) © Junho Cho, 2016 1
  • 2.
  • 3.
    This is NeuralNetwork (NN) © Junho Cho, 2016 3
  • 4.
    Machine Learning = FeatureRepresentation + Classifier such as ... © Junho Cho, 2016 4
  • 5.
    SIFT feature +SVM © Junho Cho, 2016 5
  • 6.
    HoG feature +Random Forest © Junho Cho, 2016 6
  • 7.
    HoG feature +SVM © Junho Cho, 2016 7
  • 8.
    SIFT feature +Random Forest © Junho Cho, 2016 8
  • 9.
    CNN feature +SVM © Junho Cho, 2016 9
  • 10.
    SIFT feature +Neural Network © Junho Cho, 2016 10
  • 11.
    Problem Lots of featureextractor, hard feature engineering Which classifier? Framework is extremely modular © Junho Cho, 2016 11
  • 12.
    Deep Learning enables learningrepresentation and classifier End-to-End (Learn together) © Junho Cho, 2016 12
  • 13.
    To be End-to-End Allpart in network is differentiable For Back-propagation © Junho Cho, 2016 13
  • 14.
    Much easier training(relatively too past) Don't need to extract feature yourself Just let the Neural Network Learn feature by itself! Including classifier! © Junho Cho, 2016 14
  • 15.
    Require less domainknowledge Apply to various domain! Speech, Text, Image, Reinforcement Learning © Junho Cho, 2016 15
  • 16.
    DenseCap : localizeand caption (Vision + Natural Language) © Junho Cho, 2016 16
  • 17.
    and Better performance⭐ © Junho Cho, 2016 17
  • 18.
    This is TypicalConvolutional Neural Network (CNN) [LeNet-5, LeCun 1980] © Junho Cho, 2016 18
  • 19.
    Baic CNN is composedof 1. Convolution (Conv) 2. Pooling (Subsampling) 3. Rectified Linear Unit (ReLU) 4. Fully Connected layers (FC) © Junho Cho, 2016 19
  • 20.
    Basic CNN [(Conv-ReLU)*n -POOL] * m - (FC-ReLU) * k - loss that's it! for real © Junho Cho, 2016 20
  • 21.
  • 22.
    Will explain thesecomputations later © Junho Cho, 2016 22
  • 23.
  • 24.
    Used as imagerecognizer • Object Classification(Recognition) • Object Detection • Image Captioning • Visual Q&A • Even in Alphago © Junho Cho, 2016 24
  • 25.
    Where to begin... the History! © Junho Cho, 2016 25
  • 26.
    [LeNet-5, LeCun 1980] ©Junho Cho, 2016 26
  • 27.
    Shallow CNN existed. ButDeep CNN wasn't popular 1. Computationally hard at the moment 2. Vanishing gradient problem: Can't train deep net © Junho Cho, 2016 27
  • 28.
    We wanna go deeeeeeeper ©Junho Cho, 2016 28
  • 29.
    Deep Learning This isnow solved with several advantages 1. Lots of data (ImageNet) 2. Powerful computation (GPU) 3. Some practical technique (ReLU, Dropout) © Junho Cho, 2016 29
  • 30.
    What is ImageNet? ©Junho Cho, 2016 30
  • 31.
    http://image-net.org ImageNet is animage database containing 14,197,122 images with labels. ILSVRC : challenge of Classification/Localization/Detection © Junho Cho, 2016 31
  • 32.
  • 33.
  • 34.
    Where we useit again? © Junho Cho, 2016 34
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
    Neural Art video link ©Junho Cho, 2016 43
  • 44.
    Now let's understandthe computation Conv, ReLU, Pool, FC © Junho Cho, 2016 44
  • 45.
  • 46.
    Fully Connected (FC)layers Densely connected. Compute all input neurons, Spatial information disappears. © Junho Cho, 2016 46
  • 47.
    FC is matrixmultiplcation. output = tf.matmul(input, W) input size : output size : then W has parameters © Junho Cho, 2016 47
  • 48.
    Convolution It keeps spatialinformation using convolutional filters © Junho Cho, 2016 48
  • 49.
    Reminder: 1D Convolution ©Junho Cho, 2016 49
  • 50.
    But We do2D convolution on images © Junho Cho, 2016 50
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
    Basically train theseConv filters Via Back-propagation © Junho Cho, 2016 60
  • 61.
    More hyperparameters inConv 1. stride : step of Conv filter 2. padding : add borders (usually zero) of input © Junho Cho, 2016 61
  • 62.
    Summup of Conv •Slide Conv filter over input. • Maintain spatial info with shared filter weight • Parameters: kernel_size, filterNum, padding, stride • Learning parameter: of kernel_size © Junho Cho, 2016 62
  • 63.
    In TensorFlow tf.nn.conv2d(input, filter, strides,padding, use_cudnn_on_gpu=None, data_format=None, name=None) © Junho Cho, 2016 63
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
    • kernel_size =3 • Stride = 2 • padding = 0 © Junho Cho, 2016 68
  • 69.
  • 70.
    • kernel_size =3 • Stride = 2 • padding = 1 © Junho Cho, 2016 70
  • 71.
  • 72.
    • kernel_size =3 • Stride = 1 • padding = 1 © Junho Cho, 2016 72
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
    Compare FC andConv Local Invariance figure credit to CS231n © Junho Cho, 2016 77
  • 78.
    CNN is powerfulbecause • Local invariance the convolution filters are ‘sliding’ over the input image, the exact location of the object we want to find does not matter much. • Less parameters This helps preventing overfitting. © Junho Cho, 2016 78
  • 79.
    Basically Conv isspecial case of FC. • Doubly block circulant matrix • Toeplitz matrix Conv can be implemented with Matrix Multiplication © Junho Cho, 2016 79
  • 80.
    We apply ReLUto have non-linearity of model. © Junho Cho, 2016 80
  • 81.
    We needs activationfunction after Conv © Junho Cho, 2016 81
  • 82.
    Rectified Linear Unit thisis activation function, replacing sigmoid © Junho Cho, 2016 82
  • 83.
    single Perceptron : XORis unsolvable 1. Thus bend the dimension (add non-linearity!) 2. Multi Layer Perceptron © Junho Cho, 2016 83
  • 84.
    Activation functions arelike switch. On/Off of each neurons Apply Non-linearity of Neural Network Let's test it! © Junho Cho, 2016 84
  • 85.
    sigmoid has GradientVanishing problem. ReLU is practically the best activation function in CNN © Junho Cho, 2016 85
  • 86.
    But also sigmoidand tanh is occasionally used Also there is PReLU, LeakyReLU © Junho Cho, 2016 86
  • 87.
    Pooling • makes therepresentations smaller and more manageable • Reduces number of parameters and prevent overfitting • operates over each activation map independently • Average, L2-norm, Max-pooling © Junho Cho, 2016 87
  • 88.
    Max-pooling Normally use max-poolingbecause generally performs better © Junho Cho, 2016 88
  • 89.
    However, recent modelreplace pooling with strided-Conv © Junho Cho, 2016 89
  • 90.
    Dropout While training, turnoff neuron randomly. Regularizes the model. © Junho Cho, 2016 90
  • 91.
    How they lookin code? def model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden): l1a = tf.nn.relu(tf.nn.conv2d(X, w, strides=[1, 1, 1, 1], padding='SAME')) # l1a output shape=(?, input_height, input_width, number_of_channels_layer1) l1 = tf.nn.max_pool(l1a, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # l1 output shape=(?, input_height/2, input_width/2, number_of_channels_layer1) l1 = tf.nn.dropout(l1, p_keep_conv) l2a = tf.nn.relu(tf.nn.conv2d(l1, w2, strides=[1, 1, 1, 1], padding='SAME')) # l2a output shape=(?, input_height/2, input_width/2, number_of_channels_layer2) l2 = tf.nn.max_pool(l2a, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # l2 shape=(?, input_height/4, input_width/4, number_of_channels_layer2) l2 = tf.nn.dropout(l2, p_keep_conv) l3a = tf.nn.relu(tf.nn.conv2d(l2, w3, strides=[1, 1, 1, 1], padding='SAME')) # l3a shape=(?, input_height/4, input_width/4, number_of_channels_layer3) l3 = tf.nn.max_pool(l3a, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # l3 shape=(?, input_height/8, input_width/8, number_of_channels_layer3) l3 = tf.reshape(l3, [-1, w4.get_shape().as_list()[0]]) # flatten to (?, input_height/8 * input_width/8 * number_of_channels_layer3) l3 = tf.nn.dropout(l3, p_keep_conv) l4 = tf.nn.relu(tf.matmul(l3, w4)) #fully connected_layer l4 = tf.nn.dropout(l4, p_keep_hidden) pyx = tf.matmul(l4, w_o) return pyx © Junho Cho, 2016 91
  • 92.
    Let's analyze famousarchitecutre © Junho Cho, 2016 92
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
    Only uses 3x3Conv layers. In point of receptive field, two 3x3 is better than one 5x5 or three 3x3 is better than 7x7 because of lesser parameters. Too deep network is hard to train. Performance: VGG16 > VGG19 Many researchers finetuned this net for their task, because of simplicity of the net architecture. © Junho Cho, 2016 100
  • 101.
  • 102.
    Use inception model. Enhancescale invariance. © Junho Cho, 2016 102
  • 103.
    ResNetYou can actuallytrain super deep network! © Junho Cho, 2016 103
  • 104.
    © Junho Cho,2016 104
  • 105.
    Skip-connection optimizes training! Learnresidual transformation. © Junho Cho, 2016 105
  • 106.
    © Junho Cho,2016 106
  • 107.
  • 108.
    © Junho Cho,2016 108
  • 109.
    To train yourNeural Network. Keep in mind 1. Data-preparation : preprocessing, input & output 2. Architecture : dimension check 3. Loss function : CrossEntropy? L2? 4. Optimizer : SGD, Adam, ... © Junho Cho, 2016 109
  • 110.
  • 111.
    © Junho Cho,2016 111
  • 112.
    © Junho Cho,2016 112
  • 113.
    ADAM optimizer currently rules. Justuse it. © Junho Cho, 2016 113
  • 114.
    Just replace SGD: tf.train.GradientDescentOptimizer withADAM: tf.train.AdamOptimizer slide from link © Junho Cho, 2016 114
  • 115.
    import tensorflow astf Let's do practice! © Junho Cho, 2016 115
  • 116.
    Import libraries! import tensorflowas tf from tensorflow.examples.tutorials.mnist import input_data import numpy as np import matplotlib.pyplot as plt %matplotlib inline © Junho Cho, 2016 116
  • 117.
    function for variables definit_weights(shape): return tf.Variable(tf.truncated_normal(shape, stddev=0.01)) © Junho Cho, 2016 117
  • 118.
    Define model def model(X,w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden): l1a = tf.nn.relu(tf.nn.conv2d(X, w, strides=[1, 1, 1, 1], padding='SAME')) # l1a output shape=(?, input_height, input_width, number_of_channels_layer1) l1 = tf.nn.max_pool(l1a, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # l1 output shape=(?, input_height/2, input_width/2, number_of_channels_layer1) l1 = tf.nn.dropout(l1, p_keep_conv) l2a = tf.nn.relu(tf.nn.conv2d(l1, w2, strides=[1, 1, 1, 1], padding='SAME')) # l2a output shape=(?, input_height/2, input_width/2, number_of_channels_layer2) l2 = tf.nn.max_pool(l2a, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # l2 shape=(?, input_height/4, input_width/4, number_of_channels_layer2) l2 = tf.nn.dropout(l2, p_keep_conv) l3a = tf.nn.relu(tf.nn.conv2d(l2, w3, strides=[1, 1, 1, 1], padding='SAME')) # l3a shape=(?, input_height/4, input_width/4, number_of_channels_layer3) l3 = tf.nn.max_pool(l3a, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # l3 shape=(?, input_height/8, input_width/8, number_of_channels_layer3) l3 = tf.reshape(l3, [-1, w4.get_shape().as_list()[0]]) # flatten to (?, input_height/8 * input_width/8 * number_of_channels_layer3) l3 = tf.nn.dropout(l3, p_keep_conv) l4 = tf.nn.relu(tf.matmul(l3, w4)) #fully connected_layer l4 = tf.nn.dropout(l4, p_keep_hidden) pyx = tf.matmul(l4, w_o) return pyx © Junho Cho, 2016 118
  • 119.
    Prepare Data mnist =input_data.read_data_sets("MNIST_data/", one_hot=False) X_trn, Y_trn, X_test, Y_test = mnist.train.images, mnist.train.labels, mnist.test.images, mnist.test.labels X_trn = X_trn.reshape(-1, 28, 28, 1) # 28x28x1 input img X_test = X_test.reshape(-1, 28, 28, 1) # 28x28x1 input img © Junho Cho, 2016 119
  • 120.
    Initialize w = init_weights([3,3, 1, 32]) w2 = init_weights([3, 3, 32, 64]) w3 = init_weights([3, 3, 64, 128]) w4 = init_weights([128 * 4 * 4, 625]) w_o = init_weights([625, 10]) p_keep_conv = tf.placeholder(tf.float32) p_keep_hidden = tf.placeholder(tf.float32) py_x = model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden) © Junho Cho, 2016 120
  • 121.
    Define loss function loss= tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(py_x, Y)) Select Optimizer train_op = tf.train.AdagradOptimizer(learning_rate=0.05).minimize(loss) © Junho Cho, 2016 121
  • 122.
    Then Train! of coursewith lots of debugging © Junho Cho, 2016 122
  • 123.
    Monitor accuracy ofmy model correct = tf.nn.in_top_k(py_x, Y, 1) accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) Monitor my loss function drop x = np.arange(50) plt.plot(x, trn_loss_list) plt.plot(x, test_loss_list) plt.title("cross entropy loss") plt.legend(["train loss", "test_loss"]) plt.xlabel("epoch") plt.ylabel("cross entropy") © Junho Cho, 2016 123
  • 124.
    Tips for starttraining • Good weight initialization • Random gaussian initialization • Famous initialization: Xavier, He • Or use pretrained network • ImageNet pretrained VGG16 ! © Junho Cho, 2016 124
  • 125.
    • Don't ignoredata preparation • you need lots of data • Most annoying and difficult part • Preprocessing: normalize input image • 0~256 >> -1~1 © Junho Cho, 2016 125
  • 126.
    • Visualize Trainingsamples and loss function • Don't just pray for result • Monitoring is necessary © Junho Cho, 2016 126
  • 127.
    • Check gradient& Adjust learning rate • NaN ??!! • gradient explosion: lower learning rate • Don't be too much creative • start from base code/architecture like VGG • Ensemble • gain extra 2% performance more © Junho Cho, 2016 127
  • 128.
    Other Deep LearningFrameWorks? • Torch • TensorFlow (Keras, Slim, PrettyTensor, ...) • Caffe • Theano (Keras, Lasagne) • MXnet, CNTK, PaddlePaddle, Chainer, ... © Junho Cho, 2016 128
  • 129.
  • 130.
    HardwareNo CPU, UseGPU Not AMD, Use NVidia Graphic Cards © Junho Cho, 2016 130
  • 131.
    © Junho Cho,2016 131
  • 132.
  • 133.
    Graphic cards VRAM isthe most important TitanX: 12GB, 1080/1070: 8GB, 980/970: 4GB © Junho Cho, 2016 133
  • 134.
    Most bottleneck oftraining time is actually DATA I/O or CPU Nice data loading codes already implemented © Junho Cho, 2016 134
  • 135.
    Some more CNNapplications! © Junho Cho, 2016 135
  • 136.
    Detection R-CNN, Fast-RCNN, Faster-RCNN YOLO,Single Shot Detector © Junho Cho, 2016 136
  • 137.
    Segmentation Fully Convolutional Network,DeepMask © Junho Cho, 2016 137
  • 138.
    © Junho Cho,2016 138
  • 139.
    Deconvolutionalso known asUp-Convoluion / Transpose-Convolution Computation is same as Back-propagation of Convolution It increase spatial dimension © Junho Cho, 2016 139
  • 140.
  • 141.
    © Junho Cho,2016 141
  • 142.
    Class Activation Mapping ©Junho Cho, 2016 142
  • 143.
  • 144.
    Adversarial Examples You canfool CNN classifier. © Junho Cho, 2016 144
  • 145.
    © Junho Cho,2016 145
  • 146.
    © Junho Cho,2016 146
  • 147.
  • 148.
    © Junho Cho,2016 148
  • 149.
    © Junho Cho,2016 149
  • 150.
    GAN formulation Generator Networkand Discriminator Network fools each other. Finally, Generator produces samples that fool Discriminator. © Junho Cho, 2016 150
  • 151.
    SRGAN : SuperResolution GAN © Junho Cho, 2016 151