DL (v2).pptx

Deep learning
attracts lots of attention.
• I believe you have seen lots of exciting results
before.
Deep learning trends at Google. Source: SIGMOD/Jeff Dean

Ups and downs of Deep Learning
• 1958: Perceptron (linear model)
• 1969: Perceptron has limitation
• 1980s: Multi-layer perceptron
• Do not have significant difference from DNN today
• 1986: Backpropagation
• Usually more than 3 hidden layers is not helpful
• 1989: 1 hidden layer is “good enough”, why deep?
• 2006: RBM initialization (breakthrough)
• 2009: GPU
• 2011: Start to be popular in speech recognition
• 2012: win ILSVRC image competition

Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Neural
Network

Neural Network
 
z


 
z


 
z


 
z


“Neuron”
Different connection leads to different network
structures
Neural Network
Network parameter 𝜃: all the weights and biases in the “neurons”

Fully Connect Feedforward
Network
 
z

z
  z
e
z 


1
1

Sigmoid Function
1
-1
1
-2
1
-1
1
0
4
-2
0.98
0.12

Network
1
-2
1
-1
1
0
4
-2
0.98
0.12
2
-1
-1
-2
3
-1
4
-1
0.86
0.11
0.62
0.83
0
0
-2
2
1
-1

Network
1
-2
1
-1
1
0
0.73
0.5
2
-1
-1
-2
3
-1
4
-1
0.72
0.12
0.51
0.85
0
0
-2
2
𝑓
0
0
=
0.51
0.85
𝑓
1
−1
=
0.62
0.83
0
0
This is a function.
Input vector, output vector
Given network structure, define a function set

Output
Layer
Hidden Layers
Input
Layer
Network
Input Output
1
x
2
x
Layer 1
……
N
x
……
Layer 2
……
Layer L
……
……
……
……
……
y1
y2
yM
neuron

8 layers
19 layers
22 layers
AlexNet (2012) VGG (2014) GoogleNet (2014)
16.4%
7.3%
6.7%
http://cs231n.stanford.e
du/slides/winter1516_le
cture8.pdf
Deep = Many hidden layers

AlexNet
(2012)
VGG
(2014)
GoogleNet
(2014)
152 layers
3.57%
Residual Net
(2015)
Taipei
101
101 layers
16.4%
7.3% 6.7%
Deep = Many hidden layers
Special
structure

𝜎
Matrix Operation
2
y
1
y
1
-2
1
-1
1
0
4
-2
0.98
0.12
1
−1
1 −2
−1 1
+
1
0
0.98
0.12
=
1
-1
4
−2

1
x
2
x
……
N
x
……
……
……
……
……
……
……
y1
y2
yM
Neural Network
W1 W2 WL
b2 bL
x a1
a2 y
b1
W1 x +
𝜎
b2
W2 a1 +
𝜎
bL
WL +
𝜎 aL-1
b1

= 𝜎 𝜎
1
x
2
x
……
N
x
……
……
……
……
……
……
……
y1
y2
yM
Neural Network
W1 W2 WL
b2 bL
x a1
a2 y
y = 𝑓 x
b1
W1 x +
𝜎 b2
W2 + bL
WL +
…
b1
…
Using parallel computing techniques
to speed up matrix operation

Output Layer
……
……
……
……
……
…… ……
……
y1
y2
yM
K
x
Output
Layer
Hidden Layers
Input
Layer
x
1
x
2
x
Feature extractor replacing
feature engineering
= Multi-class
Classifier
Softmax

Example Application
Input Output
16 x 16 = 256
1
x
2
x
256
x
……
Ink → 1
No ink → 0
……
y1
y2
y10
Each dimension represents
the confidence of a digit.
is 1
is 2
is 0
……
0.1
0.7
0.2
The image
is “2”

Example Application
• Handwriting Digit Recognition
Machine “2”
1
x
2
x
256
x
……
……
y1
y2
y10
is 1
is 2
is 0
……
What is needed is a
function ……
Input:
256-dim vector
output:
10-dim vector
Neural
Network

Output
Layer
Hidden Layers
Input
Layer
Example Application
Input Output
1
x
2
x
Layer 1
……
N
x
……
Layer 2
……
Layer L
……
……
……
……
“2”
……
y1
y2
y10
is 1
is 2
is 0
……
A function set containing the
candidates for
Handwriting Digit Recognition
You need to decide the network structure to
let a good function in your function set.

FAQ
• Q: How many layers? How many neurons for each
layer?
• Q: Can the structure be automatically determined?
• E.g. Evolutionary Artificial Neural Networks
• Q: Can we design the network structure?
Trial and Error Intuition
+
Convolutional Neural Network (CNN)

Loss for an Example
1
x
2
x
……
256
x
……
……
……
……
……
y1
y2
y10
Cross
Entropy
“1”
……
1
0
0
……
target
Softmax
𝐶 𝑦 , 𝑦 = −
𝑖=1
10
𝑦𝑖𝑙𝑛𝑦𝑖
𝑦1
𝑦2
𝑦10
……
Given a set of
parameters
𝑦 𝑦

Total Loss
x1
x2
xN
NN
NN
NN
……
……
y1
y2
yR
𝑦1
𝑦2
𝑦𝑁
𝐶1
……
……
x3 NN y3
𝑦3
For all training data …
𝐿 =
𝑛=1
𝑁
𝐶𝑛
Find the network
parameters 𝜽∗ that
minimize total loss L
Total Loss:
𝐶2
𝐶3
𝐶𝑁
Find a function in
function set that
minimizes total loss L

Gradient Descent
𝑤1
Compute 𝜕𝐿 𝜕𝑤1
−𝜇 𝜕𝐿 𝜕𝑤1
0.15
𝑤2
−𝜇 𝜕𝐿 𝜕𝑤2
0.05
𝑏1
Compute 𝜕𝐿 𝜕𝑏1
−𝜇 𝜕𝐿 𝜕𝑏1
0.2
……
……
0.2
-0.1
0.3
𝜃
𝜕𝐿
𝜕𝑤1
𝜕𝐿
𝜕𝑤2
⋮
𝜕𝐿
𝜕𝑏1
⋮
𝛻𝐿 =
gradient

Gradient Descent
𝑤1
−𝜇 𝜕𝐿 𝜕𝑤1
0.15
−𝜇 𝜕𝐿 𝜕𝑤1
0.09
𝑤2
−𝜇 𝜕𝐿 𝜕𝑤2
0.05
−𝜇 𝜕𝐿 𝜕𝑤2
0.15
𝑏1
−𝜇 𝜕𝐿 𝜕𝑏1
0.2
−𝜇 𝜕𝐿 𝜕𝑏1
0.10
……
……
0.2
-0.1
0.3
……
……
……
𝜃

Gradient Descent
This is the “learning” of machines in deep
learning ……
Even alpha go using this approach.
I hope you are not too disappointed :p
People image …… Actually …..

Backpropagation
• Backpropagation: an efficient way to compute 𝜕𝐿 𝜕𝑤 in
neural network
libdnn
台大周伯威
同學開發
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20b
ackprop.ecm.mp4/index.html

Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Concluding Remarks
Neural
Network
What are the benefits of deep architecture?

Layer X Size
Word Error
Rate (%)
Layer X Size
Word Error
Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Deeper is Better?
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Not surprised, more
parameters, better
performance

Universality Theorem
Reference for the reason:
http://neuralnetworksandde
eplearning.com/chap4.html
Any continuous function f
M
: R
R
f N

Can be realized by a network
with one hidden layer
(given enough hidden
neurons)
Why “Deep” neural network not “Fat” neural network?
(next lecture)

“深度學習深度學習”
• My Course: Machine learning and having it deep and
structured
• http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.
html
• 6 hour version: http://www.slideshare.net/tw_dsconf/ss-
62245351
• “Neural Networks and Deep Learning”
• written by Michael Nielsen
• http://neuralnetworksanddeeplearning.com/
• “Deep Learning”
• written by Yoshua Bengio, Ian J. Goodfellow and Aaron
Courville
• http://www.deeplearningbook.org

DL (v2).pptx

Recommended

Recommended

More Related Content

Similar to DL (v2).pptx

Similar to DL (v2).pptx (20)

Recently uploaded

Recently uploaded (20)

DL (v2).pptx

Editor's Notes