Deep Learning
Rouyun Pan
Outline
• Neural Networks

• Regression and Classification

• Deep Learning 

• Convolution neural network
2
The concept of learning 

in a ML system
• Learning = Improving with experience at some task

• Improve over task T,

• With respect to performance measure, P

• Based on experience, E.
Deep learning
CNN, RNN, LSTM ...
Machine learning
NN, SVM, DT ...
A.I.
3
Case:

Housing Price Prediction
4
Housing Price Prediction
5
Housing Price Prediction
6
Housing Price Prediction
Size
#the rooms
Zip code
View
family size
traffic
life quality
Predicted price
7
Basic neuron network
x1
Y'
x3
x4
x2
Input layer Hidden layer Output layer
8
Basic neuron network
x1
Y'
x3
x4
x2
Input layer Hidden layer Output layer
Many Weighted Sum
9
http://www.asimovinstitute.org/neural-network-zoo/
10
Learning Strategy
• Supervised learning

• Unsupervised learning

• Reinforcement learning
11
Supervised learning
• These're training data set and already know what correct
output.

• The regression problem: 

Predicting results within a continuous output

• The classification problem: 

Predicting results in a discrete output
12
Application
Input (X) Output (X) Application
House size Prices estate
AD types, User info. Click on AD Online Advertising
Image Object (1,…,1000) Photo tagging
Audio Text transcript Speech recognition
Model
standard
NN
CNN
RNN
English Chinese Machine translation
Image, Radar info Position of the cars Autonomous driving
Customized 

hybrid
13
Unsupervised learning
• The data have no target attribute.

• Analyze data, look for patterns and clustering
14
Reinforcement learning
• The agent take actions in an environment 

so as to maximize some notion of cumulative reward.
15
The workflow 

for Supervised learning
Feature 

Extraction
Train 

the model
Eval

the model
Feature 

Extraction Predict
Model
Label
Label
Model
Data
New data
• Training phase
• predicting phase
16
How to train a model
• Training data set.

• The layers and neurons

• Hypothesis / Activation function

• Cost / Loss Function 

• Optimization algorithm
17
Linear regression
18
Training dataset
19
How to choose parameters
*Choose so that is close to y for our training example (x, y)
20
Cost function
It's to quantify the gap between network outputs and actual values
mean squared error method
•
21
Cost function (conti.)
22
Calculate the cost
23
Calculate the cost
24
25
Calculate the cost
Cost function (conti.)
26
The plot for cost function
27
Find the best weights to
minimize the loss
800
- 0.12
28
Find the best weights to
minimize the loss
360
29
Find the best weights to
minimize the loss
100
0.12
30
Optimization algorithm
Gradient Descent:

A iterative optimization algorithm for finding the minimum of a function

•
* one epoch = one pass of all the training examples
31
Gradient Descent
>= 0
< 0
32
Learning rate
33
Learning rate
• ... , 0.001, 0.003, 0.01, 0.03, 0.1, 0.3. 1...
34
Local minimum
Local minimum
Global minimum
35
Local minimum
is local minimum
= 0
36
Momentum
Momentum
Movement
Movement = + Momentum
Negative of
Negative of
37
Mini-Batch optimization
• Mini-batch optimization has the following advantages.

• Reduce the memory usage.

• Avoid being trapped in the local minima with the random m
*Batch size = the number of training examples in one pass
Iterations = number of passes, each pass using [batch size] of examples
38
Back propagation (BP)
x1
predicted Y
x3
x4
x2
Input layer Hidden layer Output layer
Y ; Label
update ...
39
Feature scaling
40
Mean Normalization
• Make sure gradient descent is working properly
41
Make sure gradient descent
is working properly
•
•
42
Under/Overfitting
Overfitting - high varianceUnderfitting - high bias Sweet spot
Train error
Test error
Train error
Test error
43
Avoid Overfitting
• Reduce number of features 

• Add more training data.

• Regularization

• Dropout
44
Regularization
• Keep all the features, but reduce the magnitude of parameters.
45
Dropout
• Instead of using all neurons, "dropout" some randomly

(usually 0.5 probability)
46
Classification
•
•
•
47
Classification
48
Classification
49
Classification
50
Logistic Regression
Want
Sigmoid Function (Logistic Function)
•
51
Logistic Regression

Cost function
non- convex
52
Logistic Regression 

Cost function
53
Cost Function & Gradient Descent
• Cost function - Log loss (Cross-entropy) for sigmoid function
• Gradient Descent
54
DL Frameworks
https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software
55
Deep learning (DL)
56
Why is DL Hot Now?
57
ImageNet Challenge
58
GPU Usage for ImageNet
59
Image Classification Task
60
Convolutional 

Neural Network (CNN)
61
CNN
*Fully connected neural network *Locally connected neural network
62
CNN
*Share the weight across hidden units
63
CNN
64
Convolution
65
Visualization of Modulation
Ref: Visualizing Higher-Layer Features of a Deep Network
66
Alexnet
• A large, deep convolutional neural network (8 layers) to classify in the
training set into the 1000 different classes. 

• On the test data, It achieved top-1 and top-5 error rates of 39.7% and
18.9%
Convolutional layers Fully-connected
CONV Layers: 5 

Fully Connected Layers: 3 

Weights: 61M 

MACs: 724M
67
Alexnet
• Trained the network with 2 GPUs on ImageNet data, which contained
over 1.2 million annotated images from a total of over 1000 categories.

• Used ReLU for the nonlinearity functions (Found to decrease training
time as ReLUs are several times faster than the conventional tanh
function).

• Used data augmentation techniques that consisted of image
translations, horizontal reflections, and patch extractions.

• Implemented dropout layers in order to combat the problem of
overfitting to the training data.

• Trained the model using batch stochastic gradient descent, with specific
values for momentum and weight decay.
68
GPU & Big data
• Trained on two GTX 580 GPUs for five to six days.
69
Data augmentation
• It consisted of image translations, horizontal reflections,
and patch extractions.
70
Rectified Linear Unit (Relu)
71
Relu function
• The nonlinearity functions that be found to decrease
training time as ReLUs are several times faster than the
conventional tanh function
Relu
tanh
72
Polling
• Reduce resolution of each channel independently

• Increase translation-invariance and noise-resilience
73
Local response
normalization (LRN)
• Tries to mimic the inhibition scheme in the brain
74
Dropout
• Avoid overfitting in FC layer.
75
Revolution of Depth
http://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
76
CNN comparison
77
Demo
• Tensorflow playground

http://playground.tensorflow.org/

• ConvNetJS CIFAR-10 demo

http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
Resource
• Deep learningon on Coursera, Andrew Ng, Stanford University

https://www.coursera.org/specializations/deep-learning

• Deep Learning on MOOC

https://www.udacity.com/course/deep-learning--ud730

• Machine Learning Foundations, HT Lin, National Taiwan University

https://www.coursera.org/learn/ntumlone-mathematicalfoundations/

• TensorFlow

https://www.tensorflow.org/

• cnn-benchmarks

https://github.com/jcjohnson/cnn-benchmarks

Deep learning