3. Outline
Quick review/overview
What is Machine Learning?
What is Deep Learning?
Hands-on Tutorial of Deep Learning
Training DL Models: A multi-layer perceptron example &
training tips
Deep learning with images: Convolutional Neural Network
3
5. Machine Learning (ML) and Artificial Intelligence (AI)
AI 是用電腦模擬生物智慧的一個過程或目標
從結果來看,是否有 human intelligence
一個擁有非常詳盡的 rule-based 系統也可以是 AI
弱 AI ~= 專家系統 ~= 訓練有素的狗
Machine learning 是達成 AI 的一種方法
設計演算法與從根據問題設計特徵,並從大量數據中學
出生物智慧的 “規則”
5
Artificial
Intelligence
Machine
Learning
6. 一句話說明 Machine Learning
6
Field of study that gives computers the ability
to learn without being explicitly programmed.
- Arthur Lee Samuel, 1959
『
』
7. Goal of Machine Learning
For a specific task, find the best function to complete
Task: 每集寶可夢結束的“猜猜我是誰”
7
f*( ) =
f*( ) =
Ans: 妙蛙種子
Ans: 超夢
9. 1. Define a Set of Functions
9
Define a set
of functions
Evaluate
and
Search
Pick the best
function
A set of functions, f(‧)
{f(ө1),…,f(ө*),…,f(өn)}
在北投公園的寶可夢訓練師
10. 2. Evaluate and Search
10
Define a set
of functions
Evaluate
and
Search
Pick the best
function
f1( ) =
f1( ) =
根據結果,修正 ө :
避免找穿黃色上衣且有皮卡丘的人 (遠離 ө1 )
f1=f(ө1)
11. 3. Pick The Best Function
11
Define a set
of functions
Evaluate
and
Search
Pick the best
function
找到寶可夢訓練大師
14. Deep Learning vs. Machine Learning
Deep learning is a subset of machine learning
14
Artificial Intelligence
Machine Learning
Deep Learning
人工智慧 (Artificial Intelligence) –
讓電腦 / 機器展現人類智慧
機器學習 (Machine Learning) –
達成 AI 的方法
基於下列方法
1) 大量數據
2) 人工建立的特徵
e.g. canny filter, sobel filter.
3) 事先設計好的演算法
e.g. 羅吉斯回歸, 決策樹, 隨機森林, 支持向量機
深度學習 (Deep Learning) –
模擬人類大腦神經連結建構網路
1) 可以直接使用原始資料做為輸入
2) 透過多層神經連結, 神經網路可以在訓練
的過程中自行找到輸入與輸出之關聯
15. Deep Learning vs. Machine Learning
Deep learning is a subset of machine learning
15
Artificial Intelligence
Machine Learning
Deep Learning
人工智慧 (Artificial Intelligence) –
讓電腦 / 機器展現人類智慧
機器學習 (Machine Learning) –
達成 AI 的方法
基於下列方法
1) 大量數據
2) 人工建立的特徵
e.g. canny filter, sobel filter.
3) 事先設計好的演算法
e.g. 羅吉斯回歸, 決策樹, 隨機森林, 支持向量機
深度學習 (Deep Learning) –
模擬人類大腦神經連結建構網路
1) 可以直接使用原始資料做為輸入
2) 透過多層神經連結, 神經網路可以在訓練
的過程中自行找到輸入與輸出之關聯
16. Deep Learning vs. Machine Learning
Deep learning is a subset of machine learning
16
Artificial Intelligence
Machine Learning
Deep Learning
人工智慧 (Artificial Intelligence) –
讓電腦 / 機器展現人類智慧
機器學習 (Machine Learning) –
達成 AI 的方法
基於下列方法
1) 事先設計好的演算法
e.g. 羅吉斯回歸, 決策數, 隨機森林, 支持向量機
2) 大量數據
3) 人工建立的特徵
e.g. canny filter, sobel filter.
深度學習 (Deep Learning) –
模擬人類大腦神經連結建構網路
1) 可以直接使用原始資料做為輸入
2) 透過多層神經連結, 神經網路可以在訓練
的過程中自行找到輸入與輸出之關聯
深度學習的本質: 從 A 到 B 的映射
- by Andrew Ng
17. Applications of Deep Learning
20
*f
*f
*f
*f
“2”
“Morning”
“5-5”
“Hello”“Hi”
(what the user said) (system response)
(step)
(Slide Credit: Hung-Yi Lee)
Speech Recognition
Handwritten Recognition
Playing Go
Dialogue System
20. How neuron propagate
Example
24
σ(z)=z linear function
z = w1x1+w2x2+……+wnxn+b
A neuron
5
2
z = w1x1+w2x2+……+wnxn+b
3
-1
Output
+
z
σ(z)
3
̂y
̂y = z = (-1)*5+3*2+3 = 4 Ө: weights, bias
21. Fully Connected Neural Network
很多個 neurons 連接成 network
Universality theorem: a network with enough number
of neurons can present any function
25
X1
Xn
w1,n
w1,1
w2,n
w2,1
22. Fully Connected Neural Network -
A simple network with linear activation functions
5
2
+0.4
+0.1
+0.5
+0.9
0.12
-0.5
+0.2
-0.1
+0.5
-1.5
+0.8
0.55
PredictionForward propagation
23. output values 跟 actual values 越一致越好
A loss function is to quantify the gap between
network outputs and actual values
Loss function is a function of Ө
如何評估模型好不好?
27
X1
X2
̂y1
̂y2
y1
y2
L
f(x,Ө)
(Ө)
24. Fully Connected Neural Network -
A simple network with linear activation functions
5
2
+0.4
+0.1
+0.5
+0.9
0.12
-0.5
+0.2
-0.1
+0.5
-1.5
+0.8
y1
y2
L(Ө)
0.55
Prediction Actual
Loss
Forward propagation
25. 目標:修正𝜃, 最小化 Total Loss
Find the best function that minimize total loss
Find the best network weights, ө*
𝜃∗ = argm𝑖𝑛
𝜃
𝐿(𝜃)
最重要的問題: 該如何找到 ө* 呢?
踏破鐵鞋無覓處 (enumerate all possible values)
假設 weights 限制只能 0.0, 0.1, …, 0.9,有 500 個 weights
全部組合就有 10500 組
評估 1 秒可以做 106 組,要約 10486 年
宇宙大爆炸到現在才 1010 年
Impossible to enumerate
29
Supplementary code
numeric_method
BP_framework
26. Fully Connected Neural Network -
A simple network with linear activation functions
5
2
+0.4
+0.1
+0.5
+0.9
0.12
-0.5
+0.2
-0.1
+0.5
-1.5
+0.8
y1
y2
L(Ө)
0.55
Prediction Actual
Loss
Forward propagation
Back propagation with gradient descent
30. 影響 Gradient 的因素
34
x1
xn
…… z = w1x1+w2x2+……+wnxn+b
y
wn
w1
+
z
b
σ(z)
̂y = σ(z)
̂y
L(Ө)
Ө
𝜕𝐿
𝜕𝜃
=
𝜕𝐿
𝜕 𝑦
𝜕 𝑦
𝜕𝑧
𝜕𝑧
𝜕𝜃
=
𝜕𝐿
𝜕 𝑦
𝜕 𝑦
𝜕𝜃
1. 受 loss function 影響
2. 受 activation function 影響
…
𝜃1
= 𝜃0
− 𝜂
𝜕𝐿
𝜕𝜃 𝜃=𝜃0
3. 受 learning rate 影響
31. Summary – Gradient Descent
用來最佳化一個連續的目標函數
朝著進步的方向前進
Gradient descent
Gradient 受 loss function 影響
Gradient 受 activation function 影響
受 learning rate 影響
36
32. Gradient Descent 的缺點
一個 epoch 更新一次,收斂速度很慢
一個 epoch 等於看過所有 training data 一次
Question 1
有辦法加速嗎?
Question 2
Gradient based method 不能保證找到全域最佳解
37
33. Gradient Descent 的缺點
一個 epoch 更新一次,收斂速度很慢
一個 epoch 等於看過所有 training data 一次
Question 1
有辦法加速嗎?
A solution: stochastic gradient descent (SGD)
Question 2
Gradient based method 不能保證找到全域最佳解
38
34. Stochastic Gradient Descent
隨機抽一筆 training sample,依照其 loss 更新一次
另一個問題,一筆一筆更新也很慢
中庸之道 - Mini-batch: 每一個 mini-batch 更新一次
Benefits of mini-batch
相較於 SGD: faster to complete one epoch
相較於 GD: faster to converge (to optimum)
39
Update once Update once Update once
Loss Loss Loss
46. 處理順序資料
Ordinal variables (順序資料)
For example: {Low, Medium, High}
Encoding in order
{Low, Medium, High} {1,2,3}
Create a new feature using mean or median
60
UID Age
P1 0-17
P2 0-17
P3 55+
P4 26-35
UID Age
P1 15
P2 15
P3 70
P4 30
72. Alternative: Functional API
The way to go for defining a complex model
For example: multiple outputs, multiple input source
Why “Functional API” ?
All layers and models are callable (like function call)
Example
87
from keras.layers import Input, Dense
input = Input(shape=(200,))
output = Dense(10)(input)
73. 88
# Sequential (依序的)深度學習模型
model = Sequential()
model.add(Dense(128, input_dim=200))
model.add(Activation('sigmoid'))
model.add(Dense(256))
model.add(Activation('sigmoid'))
model.add(Dense(5))
model.add(Activation('softmax'))
model.summary()
# Functional API
from keras.layers import Input, Dense
from keras.models import Model
input = Input(shape=(200,))
x = Dense(128,activation='sigmoid')(input)
x = Dense(256,activation='sigmoid')(x)
output = Dense(5,activation='softmax')(x)
# 定義 Model (function-like)
model = Model(inputs=[input], outputs=[output])
74. Advantages for Functional API (1)
Model is callable as well, so it is easy to re-use the
trained model
Re-use the architecture and weights as well
89
# If model and input is defined already
# re-use the same architecture of the above model
y1 = model(input)
75. Advantages for Functional API (2)
Easy to manipulate various input sources
90
x2
Dense(100) Dense(200)y1x1 outputnew_x2
x1 = input(shape=(10,))
y1 = Dense(100)(x1)
x2 = input(shape=(20,))
new_x2 = keras.layers.concatenate([y1,x2])
output = Dense(200)(new_x2)
Model = Model(inputs=[x1,x2],outputs=[output])
76. Today
Our exercise uses “Functional API” model
Because it is more flexible than sequential model
Ex00_first_model
91
80. Tips for Deep Learning
95
No
Activation Function
YesGood result on
training dataset?
Loss Function
Good result on
testing dataset?
Optimizer
Learning Rate
81. Tips for Deep Learning
96
No
Activation Function
YesGood result on
training dataset?
Loss Function
Good result on
testing dataset?
Optimizer
Learning Rate
𝜕𝐿
𝜕𝜃
=
𝜕𝐿
𝜕 𝑦
𝜕 𝑦
𝜕𝑧
𝜕𝑧
𝜕𝜃
受 loss function 影響
82. Using MSE
在指定 loss function 時
ex01_lossFunc_Selection
97
# 指定 loss function 和 optimizer
model.compile(loss='categorical_crossentropy',
optimizer=sgd)
# 指定 loss function 和 optimizer
model.compile(loss='mean_squared_error',
optimizer=sgd)
85. How to Select Loss function
Classification 常用 cross-entropy
搭配 softmax 當作 output layer 的 activation function
Regression 常用 mean absolute / squared error
100
86. Current Best Model Configuration
101
Component Selection
Loss function categorical_crossentropy
Activation function sigmoid + softmax
Optimizer SGD
87. Tips for Deep Learning
102
No
Activation Function
YesGood result on
training data?
Loss Function
Good result on
testing data?
Optimizer
Learning Rate
90. How to Set Learning Rate
大多要試試看才知道,通常不會大於 0.1
一次調一個數量級
0.1 0.01 0.001
0.01 0.012 0.015 0.018 …
幸運數字!
105
91. Tips for Deep Learning
106
No
Activation Function
YesGood result on
training data?
Loss Function
Good result on
testing data?
Optimizer
Learning Rate 𝜕𝐿
𝜕𝜃
=
𝜕𝐿
𝜕 𝑦
𝜕 𝑦
𝜕𝑧
𝜕𝑧
𝜕𝜃
受 activation function 影響
97. Leaky ReLU
Allow a small gradient while the input to activation
function smaller than 0
113
α=0.1
f(x)= x if x > 0,
αx otherwise.
df/dx=
1 if x > 0,
α otherwise.
98. Leaky ReLU in Keras
更多其他的 activation functions
https://keras.io/layers/advanced-activations/
114
# For example
From keras.layer.advanced_activation import LeakyReLU
lrelu = LeakyReLU(alpha = 0.02)
model.add(Dense(128, input_dim = 200))
# 指定 activation function
model.add(lrelu)
101. How to Select Activation Functions
Hidden layers
通常會用 ReLU
Sigmoid 有 vanishing gradient 的問題較不推薦
Output layer
Regression: linear
Classification: softmax / sigmoid
117
102. Current Best Model Configuration
118
Component Selection
Loss function categorical_crossentropy
Activation function relu + softmax
Optimizer SGD
103. Tips for Deep Learning
119
No
YesGood result on
training dataset?
Good result on
testing dataset?
Activation Function
Loss Function
Optimizer
Learning Rate
104. Optimizers in Keras
SGD – Stochastic Gradient Descent
Adagrad – Adaptive Learning Rate
RMSprop – Similar with Adagrad
Adam – Similar with RMSprop + Momentum
120
105. Optimizer – SGD
Stochastic gradient descent
支援 momentum, learning rate decay, Nesterov momentum
Momentum 的影響
無 momentum: update = -lr*gradient
有 momentum: update = -lr*gradient + m*last_update
Learning rate decay after update once
屬於 1/t decay lr = lr / (1 + decay*t)
t: number of done updates
121
118. 一般的起手式: Adam
Adaptive learning rate for every weights
Momentum included
134
How to Select Optimizers
119. Tips for Deep Learning
135
No
YesGood result on
training data?
Good result on
testing data?
Activation Function
Loss Function
Optimizer
Learning Rate
120. Current Best Model Configuration
136
Component Selection
Loss function categorical_crossentropy
Activation function relu + softmax
Optimizer Adam
50 epochs 後
90% 準確率!
123. Tips for Deep Learning
139
YesGood result on
training dataset?
YesGood result on
testing dataset?
No
什麼是 overfitting?
training result 進步,但 testing result 反而變差
Early Stopping
Regularization
Dropout
Batch Normalization
124. Tips for Deep Learning
140
YesGood result on
training dataset?
YesGood result on
testing dataset?
No
Early Stopping
Regularization
Dropout
Batch Normalization
132. Tips for Deep Learning
148
YesGood result on
training dataset?
YesGood result on
testing dataset?
No
Early Stopping
Regularization
Dropout
Batch Normalization
137. Tips for Deep Learning
153
YesGood result on
training dataset?
YesGood result on
testing dataset?
No
Early Stopping
Regularization
Dropout
Batch Normalization
144. How to Set Dropout
不要一開始就加入 Dropout
不要一開始就加入 Dropout
不要一開始就加入 Dropout
a) Dropout 會讓 training performance 變差
b) Dropout 是在避免 overfitting,不是萬靈丹
c) 參數少時,regularization
160
145. Tips for Deep Learning
161
YesGood result on
training dataset?
YesGood result on
testing dataset?
No
Early Stopping
Regularization
Dropout
Batch Normalization
146. 回顧一下
對於 Input 的數值,前面提到建議要 re-scale
Weights 修正的路徑比較會在同心圓山谷中往下滑
But …, 這是只有在 inputs 啊? 如果神經網路很深, 中
間每一層的輸出會變得無法控制 (due to nonlinear
functions inside networks)
162
w1
w2
w1
w2
Loss 等高線 Loss 等高線
147. 回顧一下
對於 Input 的數值,前面提到建議要 re-scale
Weights 修正的路徑比較會在同心圓山谷中往下滑
But …, 這是只有在 inputs 啊? 如果神經網路很深, 中
間每一層的輸出會變得無法控制 (due to nonlinear
functions inside networks)
163
w1
w2
w1
w2
Loss 等高線 Loss 等高線只加在輸入層 re-scale 不夠
那你有試著每一層都 re-scale 嗎?
148. Batch Normalization
每個 input feature 獨立做 normalization
利用 batch statistics 做 normalization 而非整份資料
同一筆資料在不同的 batch 中會有些微不同 ( a kind of
data augmentation)
164
151. BN in Keras
167
from keras.layers import BatchNormalization
model = Sequential()
model.add(Dense(128, input_dim=200))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dense(256))
model.add(BatchNormalization())
model.add(Activation('relu'))
# 加入 output layer (5 neurons)
model.add(Dense(5))
model.add(Activation('softmax'))
# 觀察 model summary
model.summary()
ex09_BatchNorm
152. Tips for Training Your Own DL Model
177
Yes Good result on
testing dataset?
No
Early Stopping
Regularization
Dropout
No
Activation Function
Loss Function
Optimizer
Learning Rate
Good result on
training dataset?
Batch Normalization
164. Feedforward DNN
CNN Structure
189
Convolutional Layer
&
Activation Layer
Pooling Layer
Can be performed
may times
Flatten
PICACHU!!!
Pooling Layer
Convolutional Layer
&
Activation Layer
New image
New image
New image
165. Adding each pixel and its local neighbors which are
weighted by a filter (kernel)
Perform this convolution process to every pixels
Convolution in Computer Vision (CV)
190
Image
Filter
182. The shape after convolution
P = padding size
S = stride
Formula for calculating new height or width
𝑛𝑒𝑤_ℎ𝑒𝑖𝑔ℎ𝑡 =
(𝑖𝑛𝑝𝑢𝑡_ℎ𝑒𝑖𝑔ℎ𝑡−𝑓𝑖𝑙𝑡𝑒𝑟_ℎ𝑒𝑖𝑔ℎ𝑡+2×𝑃)
𝑆
+ 1
𝑛𝑒𝑤_𝑤𝑖𝑑𝑡ℎ =
(𝑖𝑛𝑝𝑢𝑡_𝑤𝑖𝑑𝑡ℎ−𝑓𝑖𝑙𝑡𝑒𝑟_𝑤𝑖𝑑𝑡ℎ+2×𝑃)
𝑆
+ 1
Total number of weights is needed from 𝐿 𝑛 to 𝐿 𝑛+1
number of filters = k
𝑊𝑓 × 𝐻𝑓 × 𝐷 𝑛 × 𝑘 + 𝑘
208
183. Quiz
parameters
input shape = 7 x 7 x 3 (H x W x D)
20 filters of shape = 3 x 3 x 3 (H x W x D)
stride = 2
no padding
What's the shape of the output?
Ans: 3 x 3 x 20
209
187. Pooling Layer
Why do we need pooling layers?
Reduce the number of weights
Prevent overfitting
Max pooling
Consider the existence of patterns in each region
213
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
2 3
3 3
Max pooling
*How about average pooling?
200. How to Load CIFAR-10 Dataset by Keras
This reading function is provided from the Keras
226
# this function is provided from the official site
from keras.datasets import cifar10
# read train/test data
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
# check the data shape
print("x_train shape:", x_train.shape)
print(“y_train shape:", y_train.shape)
print("numbers of training smaples:", x_train.shape[0])
print("numbers of testing smaples:", x_test.shape[0])
201. Show the images
227
import matplotlib.pyplot as plt
%matplotlib inline
# show the first image of training data
plt.imshow(x_train[0])
# show the first image of testing data
plt.imshow(x_test[0])
203. '''CNN model'''
model = Sequential()
model.add(
Convolution2D(32, (3, 3), padding='same',
input_shape=x_train[0].shape)
)
model.add(Activation('relu'))
model.add(Convolution2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10))
model.add(Activation('softmax'))
Building Your Own CNN Model
229
32 個 3x3 filters
‘same’: perform padding
default value is zero
‘valid’ : without padding
CNN
DNN
204. Model Compilation
230
'''setting optimizer'''
# define the learning rate
learning_rate = 0.00017
# define the optimizer
optimizer = Adam(lr=learning_rate)
# Let’s compile
model.compile(loss='categorical_crossentropy', optimizer=
optimizer, metrics=['accuracy'])
205. Let’s Try CNN
ex0_CNN_practice.ipynb
Training time: CPU vs. GPU
Performance: CNN vs. DNN
Design your own CNN model (optional)
231
Figure reference
https://unsplash.com/collections/186797/coding
221. Let’s Try Data Augmentation
ex1_CNN_data_augmentation.ipynb
Performance: data augmentation vs. no data augmentation
Try different data augmentation method
247
Figure reference
https://unsplash.com/collections/186797/coding
223. Introduction
“transfer”: use the knowledge learned from task A to
tackle another task B
Example: 綿羊/羊駝 classifier
249
綿羊
羊駝
其他動物的圖
224. Use as Fixed Feature Extractor
A known model, like VGG, trained on ImageNet
ImageNet: 10 millions images with labels
250
OutputInput
取某一個 layer output
當作 feature vectors
Train a classifier based on the features
extracted by a known model
225. Use as Initialization
Initialize your net by the
weights of a known model
Use your dataset to further
train your model
Fine-tuning the known
model
251
OutputInput
OutputInput
VGG model
Your model
231. Example of transfer learning in keras
from keras.applications.vgg16 import VGG16
keras.applications.vgg16.VGG16(
include_top=True, # include top layer ?
weights=‘imagenet’) # pretrained weights
If include_top = False, input_shape could be different,
but should not too small.
257
232. Let’s Try Transfer learning
ex2_CNN_transfer_learning.ipynb
Performance: Pure CNN vs. VGG16
258
Figure reference
https://unsplash.com/collections/186797/coding
234. Recap – Fundamentals
Fundamentals of deep learning
A neural network = a function
Gradient descent
Stochastic gradient descent
Mini-batch
Guidelines to determine a network structure
260
235. Recap – Improvement on Training Set
How to improve performance on training dataset
261
Activation Function
Loss Function
Optimizer
Learning Rate
236. Recap – Improvement on Testing Set
How to improve performance on testing dataset
262
Early Stopping
Regularization
Dropout
237. Recap – CNN
Fundamentals of CNN
Concept of filters
Hyper-parameters
Filter size
Zero-padding
Stride
Depth (total number of filters)
Pooling layers
How to train a CNN in Keras
CIFAR-10 dataset
263
238. Go Deeper in Deep Learning
“Deep Learning”
Written by Yoshua Bengio, Ian J. Goodfellow
and Aaron Courville
http://www.iro.umontreal.ca/~bengioy/dlbook/
Course: Deep Learning Specialization (Coursera)
Andrew NG
https://www.coursera.org/specializations/deep-learning
Course: Machine learning and having it deep and
structured
http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_
2.html
(Slide Credit: Hung-Yi Lee)
239. References
Keras documentation Keras 官方網站,非常詳細
Keras Github 可以從 example/ 中找到適合自己應用的範例
Youtube 頻道 – 台大電機李宏毅教授
Convolutional Neural Networks for Visual Recognition cs231n
若有課程上的建議,歡迎來信
sean@aiacademy.tw / jim@aiacademy.tw / chuan@aiacadymy.tw
cmchang@iis.sinica.edu.tw and chihfan@iis.sinica.edu.tw
265
Warren McCulloch, a neurophysiologist, and a young mathematician, Walter Pitts, wrote a paper on how neurons might work. They modeled a simple neural network with electrical circuits
Why DL back again:
Back propagation
Data
GPU