20200323 - AI Intro

機器學習與類神經網路基礎
Taka Wang
2020/03/25

Source: Train Object Detection AI with 6 lines of code

import tensorflow as tf
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
network = tf.keras.models.Sequential()
network.add(tf.keras.layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(tf.keras.layers.Dense(10, activation='softmax'))
network.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy'])
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images / 255.0
train_labels = tf.keras.utils.to_categorical(train_labels)
test_labels = tf.keras.utils.to_categorical(test_labels)
network.fit(train_images, train_labels, epochs=5, batch_size=128)
test_loss, test_acc = network.evaluate(test_images, test_labels)

Source: Prowess
類神經網路

Source: 超智能體：⾼考與機器學習

The only skill that will be important in
the 21st century is the skill of learning
new skills. Everything else will become
obsolete over time.
- Peter Drucker

There’s this very simple concept that Carol Dweck talks
about, which is if you take two people, one of them is a
learn-it-all and the other one is a know-it-all, the learn-it-all
will always trump the know-it-all in the long run.
A NEW CORPORATE MINDSET
Source: 主動打造⼈才

Data Training Model
Observation Learning Skill
Data ML Skill
Improved
performance
Measure

什麼問題適合使⽤機器學習(AI)
預測⼩嬰兒下⼀次哭是否在偶數分鐘
決定在⼀個 Graph 裡⾯是否有 Cycle
決定是否要核卡給某些顧客
預測未來⼗年地球是否會因為核能災害滅亡
Source: 林軒⽥機器學習基⽯

⼀些判斷⼩訣竅
exists some underlying pattern to be learned
performance measure can be improved
but no programmable definition
there is data about the pattern

Types of Machine Learning
Source: Machine Learning for Business

Low Density Region
Source: Anomaly/Novelty detection with scikit-learn

Semi-Supervised Learning
Source: APPLE Photo

時序決策任務近似動態規劃
Reinforcement
Learning

Batch Learning vs Online Learning
Train ML
Model
Data
Launch!
Evaluate
Solution
Run and Learn
New data (on the fly)
遺忘速度
Learning Rate

• Transfer Learning
• Online Learning
• Reinforcement Learning
• Deep Learning
• Unsupervised Learning
• Semi-supervised Learning
• Boosting Machines

Model Cost Function
Linear Regression
ypred = mx + b MAE =
∑
n
i=1
|ypred,i − yi |
n
RMSE =
∑
n
i=1
(ypred,i − yi)2
n

Linear Regression
Source: GIPHY
步伐: Learning Rate

Bad Data
• 好的資料不夠多
• 不夠代表性
• 樣本太少 —> Sampling Noise
• 樣本多但是抽樣⽅法不好 —> Sampling Bias
• 資料品質不好
• 丟掉屬性，丟掉 Sample，填值，交叉比對...
• 不相關的屬性

Bias-Variance
Tradeoﬀ
射箭技術

Overfitting vs Underfitting
Image Source
Regression
Classification

Image Source
Bias-Variance
Tradeoﬀ

實務上觀察的⽅式
Training Error 好
Validation Error 差
—> overfitting
Training Error 差
Validation Error 差
—> underfitting

解決 underfitting 的⽅法
• 換⼀個更複雜的模型
• 增加訓練的 iteration 次數
• 調整 hyperparameter (修改模型架構)
• 產⽣更好的特徵來訓練模型
• 去掉 Regularization 項 (如果有)
Depth 1
Depth 2
Depth 3

解決 overfitting 的⽅法
• 換⼀個較簡單的模型
• 增加訓練資料 (ex. data augmentation)
• 降低特徵維度
• 使⽤ Regularization
• Dropout
• Weight decay
• Early Stopping
• 調整 hyperparameter (修改模型架構)
• Mini-Batch (後⾯再說)

Dropout Regularization
p[0]
= 0.0 p[1]
= 0.0 p[2]
= 0.5 p[3]
= 0.0 p[4]
= 0.25
效果類似 Bagging
延伸閱讀

Source: US Patent
神經網路 Dropout 申請專利
Google：只為保護⾃⼰

蜻蜓啟⽰錄
Training Samples Testing Samples
Ref: ⼀隻蚊⼦告訴你，什麼是正則化

綁上掛墜
Training Samples Testing Samples
Ref: ⼀隻蚊⼦告訴你，什麼是正則化

訓練集與測試集
訓練集測試集

訓練集與測試集
訓練集測試集
Testing Error
Estimation of Generalization Error
Out-of-Sample Error

模型選擇
Generalization Error

超參數調整
實驗各種參數組合
Generalization Error 好棒棒
實戰還是爛掉了
可能對近年考古題優化了?!
該買新的考古題了!

超參數調整
近⼗年
考古題
去年
考古題
聯考
測試集
訓練集未知
仿造三階層關係

超參數調整
近⼗年
考古題
去年
考古題
聯考
測試集
訓練集未知
近⼗年
考古題
近三年
考古題
去年
考古題
驗證集
訓練集測試集
聯考
未知

超參數調整
近⼗年
考古題
去年
考古題
聯考
測試集
訓練集未知
近⼗年
考古題
近三年
考古題
去年
考古題
驗證集
訓練集測試集
聯考
未知
這個
Generalization Error 推估
會更有意義

Validation Set 遭遇的問題
驗證集切太⼤訓練集剩太少
無法代表驗證集與
訓練集合起來的效果
驗證集切太⼩評估很不準

測試集
n-fold
cross-validation
訓練集
驗證集

Data Mismatch Problem
（實務上常⾒)

情境描述
• 我們正在設計⼀⽔果分類器 (寶寶監視器問題也可以)，我們打算把它⽤在⼿機上
• 網路上很容易可以搜集到百萬張的⽔果圖片
• ⼿上擁有的⼿機拍攝，⼈⼯註記的圖片只有⼀萬張
• 我們很謹慎地將⼈⼯註記資料分成驗證集與測試集，全部⽤網路爬回來的圖
片當訓練集訓練，結果 Generalization Error 很差，這是什麼情況？
• ⼜要賴給 overfitting?!

訓練集測試集
驗證集
train-dev
網路爬回來的⼿機拍的

訓練集測試集
驗證集
train-dev
⽤ Train-Dev 看 Overfitting
⽤驗證集看 Data Mismatch

訓練集測試集
驗證集
train-dev
確保驗證集跟測試集與未來
打算推論的資料來⾃同⼀分佈
⽤ Train-Dev 看 Overfitting
⽤驗證集看 Data Mismatch

神經網路的啟發
神經元

神經網路的啟發
神經元
訊號

⽣物事實：All or Nothing
標度盤
閥值

⽣物事實：All or Nothing
標度盤
閥值
輸出
無輸出

簡化模型
激發函數
輸入總和
輸入 c
輸入 b
輸入 a
輸入 y
H(x)
x = a + b + c

Perceptron
x1
x2
xd
w1
w2
wd
y
b
f(x) =
{
0, if x < 0
1, if x ≥ 0

Why Bias Term?
Source: Tommy Huang 線性迴歸

Classification Problem
x1
x2
(0,0)
(0,1)
(1,1)
(1,0)
x1 + x2 − 0.5 = 0
x1
x2
1
1
y
-0.5

XOR Problem
x1
x2
(0,0)
(0,1)
(1,1)
(1,0)

Multilayer Perceptron (MLP)
1
2
3
1
2
3
1
2
3
輸入層輸出層
隱藏層
輸入輸出
神經元連接

Common Activation Functions
f(x) =
1
1 + e−x
f(x) =
1 − e−2x
1 + e−2x
f(x) =
{
0, if x < 0
x, if x ≥ 0
Sigmoid Tanh
Rectified
Linear Units
(RELU)
Leaky
Relu

前⽅⾼能預警
請做好準備

Vector & Basis
(3,4)
(1,0)
(0,1)
(
1 0
0 1) (
3
4)
= 3
(
1
0)
+ 4
(
0
1)
=
(
3
4)
v = 3 ̂
i + 4 ̂
j
̂
i =
[
1
0]
̂
j =
[
0
1]
v =
[
3
4] Basis
(
a b
c d) (
x
y) = x (
a
c) + y
(
b
d)
=
(
ax + by
cx + dy)

Matrix Representation
1
2
1
2
L1 L2
x1
W1,1
W2,1
W1,2
W2,2
x2
[
w1,1 w1,2
w2,1 w2,2] [
x1
x2]
=
[
(x1w1,1) + (x2w1,2)
(x1w2,1) + (x2w2,2)]
y1 = (x1w1,1) + (x2w1,2)
y2 = (x1w2,1) + (x2w2,2)
x1 [
w1,1
w2,1]
+ x2 [
w1,2
w2,2]
=
[
(x1w1,1) + (x2w1,2)
(x1w2,1) + (x2w2,2)]
=
[
y1
y2]
(
a b
c d) (
x
y) = x (
a
c) + y
(
b
d)
=
(
ax + by
cx + dy)
(
1 0
0 1) (
3
4)
= 3
(
1
0)
+ 4
(
0
1)
=
(
3
4)
y = wx

x_, y_ = np.array(x, dtype=np.int64), np.array(y, dtype=np.int64)
%%timeit
z = x_ + y_
147 µs ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
NumPy
%%timeit
z = x_ + y_
46.6 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
NumPy
%%timeit
i, z = 0, []
while i < n:
z.append(x[i] + y[i])
i += 1
While Loop
%%timeit
z = []
for i in range(n):
z.append(x[i] + y[i])
For Loop
import random
import numpy as np
r = [random.randrange(100) for _ in range(100_000)] # ⼗萬個 0~99 的隨機數
n = 1_000
x, y = random.sample(r, n), random.sample(r, n)
向量加法

%%timeit
z = x_ + y_
46.6 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
NumPy
%%timeit
z = x_ + y_
NumPy
%%timeit
i, z = 0, []
while i < m:
j, z_ = 0, []
while j < n:
z_.append(x[i][j] + y[i][j])
j += 1
z.append(z_)
i += 1
35.8 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
While Loop
%%timeit
z = []
for i in range(m):
z_ = []
for j in range(n):
z_.append(x[i][j] + y[i][j])
z.append(z_)
31.4 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For Loop
m, n = 100, 1_000 # 模擬 100x1000 matrix
x = [random.sample(r, n) for _ in range(m)]
y = [random.sample(r, n) for _ in range(m)]
矩陣加法

Vectorization
1
2
1
2
L1 L2
W1,1
W1,2
y1 = (x1w1,1) + (x2w1,2)
x1
x2
import numpy as np
# define two arrays a, b:
a = np.random.rand(1_000_000)
b = np.random.rand(1_000_000)
%%timeit
c = np.dot(a,b) # 求內積
1.09 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print(c)
249581.03910735418
NumPy
%%timeit
c = 0
for i in range(1_000_000):
c += a[i]*b[i]
781 ms ± 30.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(c)
249581.03910735913
For Loop

import numpy as np
import torch
import random
r = [random.randrange(100) for _ in range(100_000_000)] # ⼀億個 0~99 的隨機數
n = 1_000_000
x, y = random.sample(r, n), random.sample(r, n)
向量加法
%%timeit
z = x_ + y_
NumPy
tensor_x = torch.from_numpy(x_)
tensor_y = torch.from_numpy(y_)
device = torch.device('cuda')
x = tensor_x.to(device)
y = tensor_y.to(device)
%%timeit
z = x + y
61.7 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
PyTorch

Axis (軸)
import numpy as np
a = np.array([[0, 1], [2, 3], [4, 5]])
print(a)
array([[0, 1],
[2, 3],
[4, 5]])
print(a.shape)
(3, 2)
[[0, 1],
[2, 3],
[4, 5]]
axis=0
外層
[[0, 1],
[2, 3],
[4, 5]]
axis=1
內層

b = np.array([a, a])
print(b)
[[[0 1]
[2 3]
[4 5]]
[[0 1]
[2 3]
[4 5]]]
print(b.shape)
(2, 3, 2)
0 1
2 3
4 5
0 1
2 3
4 5
axis=0
axis=1
axis=2
b.sum(axis=0)
array([[ 0, 2],
[ 4, 6],
[ 8, 10]])
0 1
2 3
4 5
0 1
2 3
4 5
axis=0
兩個 3x2 矩陣
b.sum(axis=2)
array([[1, 5, 9],
[1, 5, 9]])
0 1
2 3
4 5
0 1
2 3
4 5
axis=2
b.sum(axis=1)
array([[6, 9],
[6, 9]])
0 1
2 3
4 5
0 1
2 3
4 5
axis=1
3軸矩陣

Broadcasting
import numpy as np
A = np.array(
[[1,2,3],
[4,5,6],
[7,8,9]])
print("A:n", A)
print("nA*2:n", A * 2) # multiply by 2
print("nA+10:n", A + 10) # add 10
B = np.array([[10],
[100],
[1000]])
print("nB:n", B)
print("nA+B:n", A+B)
print("nA*B:n", A*B)
A:
[[1 2 3]
[4 5 6]
[7 8 9]]
A*2:
[[ 2 4 6]
[ 8 10 12]
[14 16 18]]
A+10:
[[11 12 13]
[14 15 16]
[17 18 19]]
B:
[[ 10]
[ 100]
[1000]]
A+B:
[[ 11 12 13]
[ 104 105 106]
[1007 1008 1009]]
A*B:
[[ 10 20 30]
[ 400 500 600]
[7000 8000 9000]]

RELU Revisited
f(x) =
{
0, if x < 0
x, if x ≥ 0
X = np.array([[1,-2,3,-4],
[-9,4,5,6]])
Y = np.maximum(0, X)
print(Y)
[[1 0 3 0]
[0 4 5 6]]
y = max(0,x)

Differential
1.8
-2
slope =
Δy
Δx
=
−4.4
4
4
-4.4
y′ = f′(x) = f′ =
dy
dx
=
d
dx
f(x) = lim
Δx→0
Δy
Δx
= lim
Δx→0
f(x + Δy) − f(x)
Δx
f(x)
f(x)

Source: http://calculus.nctu.edu.tw/
fx(x, y) =
∂
∂x
f(x, y) = zx =
∂z
∂x
fy(x, y) =
∂
∂y
f(x, y) = zy =
∂z
∂y
z = f(x, y)

Linear Regression
Source: GIPHY

Derivative of Activation Functions
f(x) =
1
1 + e−x
f(x) =
{
0, if x < 0
x, if x ≥ 0
Sigmoid
Rectified
Linear Units
(RELU)
f(x) =
1 − e−2x
1 + e−2x
Tanh
Derivative Derivative Derivative

新品試吃問題
新品試吃⼤會，找來了⼗位同事試吃，滿分給予100分
⼤家都給出⾼於90的好成績
成績都紀錄在紙上，沒有輸入電腦
這時候主管卻問，那平均是幾分呢？
受試者滿意度
A 98
B 99
C 96
D 91
E 99
F 91
G 94
H 93
I 99
J 90
98 + 99 + 96 + 91 + 99 + 91 + 94 + 93 + 99 + 90
10
= 95

受試者滿意度
A 98
B 99
C 96
D 91
E 99
F 91
G 94
H 93
I 99
J 90
滿意度基準分偏差
98 90 8
99 90 9
96 90 6
91 90 1
99 90 9
91 90 1
94 90 4
93 90 3
99 90 9
90 90 0
90 +
8 + 9 + 6 + 1 + 9 + 1 + 4 + 3 + 9 + 0
10
= 90 +
50
10
= 95

⾝⾼問題
男⽣⾝⾼
A 171
B 176
C 182
D 165
E 170
部⾨內男⽣女⽣各有數個⼈
我們要準備訂製T恤
想知道男⽣女⽣各應該準備幾種尺⼨版型 (⾝⾼分散度)
女⽣⾝⾼
A 160
B 158
C 157
D 152
E 150

Variance & Standard Deviation
變異數 (variance) 是⽤來衡量資料發散程度的統計量
140
150
160
170
180
190
0 1 2 3 4 5 6 7 8
160
158 157
152
150
171
176
182
165
170
Male
Female
女⽣⾝⾼平均偏差
160 155 +5
158 155 +3
157 155 +2
152 155 -3
150 155 -5
5 + 3 + 2 + (−3) + (−5) = 2
52
+ 32
+ 22
+ (−3)2
+ (−5)2
= 72
52
+ 32
+ 22
+ (−3)2
+ (−5)2
5
= 14.4
σ =
1
n
n
∑
i=1
(xi − μ)2
, where μ =
n
∑
i=1
x2
i
標
準
差
Var(X) = σ2
= E[X − μ2
] =
1
n
n
∑
i=1
(xi − μ)2
變
異
數

Feature Scaling
(Normalization)

σ =
1
n
n
∑
i=1
(xi − μ)2
, where μ =
n
∑
i=1
x2
i
z =
(x − μ)
σ
Xnorm =
X − Xmin
Xmax − Xmin
standardization min-max scaling

香腸攤賭博問題
• 假設現在有⼀個擲骰⼦遊戲，⼀次只擲⼀顆，我們假設骰⼦是公正的
• 若擲出 1 或 2，老闆會賠⼀條香腸給你
• 若擲出其他點數，你要賠老闆兩條的錢
• 請問你會下場玩這個遊戲嗎？
• 若擲出 1 或 2，老闆會賠三條香腸給你，這樣你會下場玩嗎？

彩卷價值問題
• 迎棧科技發⾏彩卷⼀萬張，每張彩卷有四個數字，從0000到9999
• 頭獎是四碼全中，⼆獎是末三碼全中，三獎是末兩碼，四獎是最後⼀碼
• 每個彩卷不能重複得兩個獎
獎項名額獎⾦
頭獎 1 50,000
⼆獎 9 10,000
三獎 90 2,000
四獎 900 500
未得獎 9000 0

包牌
獎項名額獎⾦
頭獎 1 50,000
⼆獎 9 10,000
三獎 90 2,000
四獎 900 500
未得獎 9000 0
50000 × 1 + 10000 × 9 + 2000 × 90 + 500 × 900 = 770000
770000
10000
= 77
⼀張彩卷的價值

獎項名額獎⾦
頭獎 1 50,000
⼆獎 9 10,000
三獎 90 2,000
四獎 900 500
未得獎 9000 0
77 =
770000
10000
=
50000 × 1 + 10000 × 9 + 2000 × 90 + 500 × 900
10000
= 50000 ×
1
10000
+ 10000 ×
9
10000
+ 2000 ×
90
10000
+ 500 ×
900
10000
頭獎報酬
頭獎機率平均價值就是期望值

最佳策略問題 1
• ⼝袋有四⾊球
• 每次可以問⼀個問題，然後抽⼀顆球
• ⽬標：最少需要問幾個問題，⼀定得到正確答案
最好的策略為2次
期望值為2
1
4
× 2 +
1
4
× 2 +
1
4
× 2 +
1
4
× 2 = 2
是藍⾊或紅⾊嗎？
是藍⾊嗎？是綠⾊嗎？
Y
Y N Y
N
N

• 規則同問題1，但機率改變了
1
8
1
8
1
4
1
2
Y
N
Y
N
是紅⾊嗎？
是藍⾊嗎？
Y N
是綠⾊嗎？
• 藍⾊球⼀個問題就可以確定
• 紅⾊球要兩個問題
• 綠⾊跟橘⾊球要三個問題
1
2
× 1 +
1
4
× 2 +
1
8
× 3 +
1
8
× 3 = 1.75
期望值

• 規則同問題1，⼝袋中只有藍⾊球
log2
1
p
• 歸納上述問題，若⾊球出現的機率為 p，則猜中該⾊球需要的問題數為
1
4
log24 = 2
• 例如問題2，紅⾊球機率為，要兩個問題才能猜中
∑
i
pi × log2
1
pi
= −
∑
i
pi × log2pi
• 整個題⽬所需的問題個數為期望值
log21 = 0
所需問題數

(Shannon) Entropy vs Randomness
Entropy is maximum at maximum randomness
𝖧(𝖷) = −
n
∑
i=1
P(xi)logb P(xi)
= −
2
∑
i=1
1
2
log2
1
2
= −
2
∑
i=1
1
2
⋅ (−1) = 1

問題 2 使⽤策略1
• 規則同問題1，但機率為
1
8
1
8
1
4
1
2
是藍⾊或紅⾊嗎？
是藍⾊嗎？是綠⾊嗎？
Y
Y N Y
N
N
1
8
× 2 +
1
8
× 2 +
1
4
× 2 +
1
2
× 2 = 2 期望值
給定⼀個策略，cross entropy 就是該策略下猜中顏⾊的期望值
這個策略比較差

數學上的定義
∑
i
pi × log2
1
pi
= −
∑
i
pi × log2pi
L(y, ̂
y) = −
M
∑
j=0
N
∑
i=0
(yij * log( ̂
yij))
categorical cross entropy cost function
−
∑
i
pi × log2
̂
pi
為真實的機率
pi
為錯誤假設的機率
̂
pi
cross entropy
network.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy'])

⽤ Keras 六⾏寫出數字分類器

MNIST Dataset
• MNIST 數據集來⾃ National Institute
of Standards and Technology (NIST)
• 訓練樣本由 250 個不同⼈⼿寫的數字
構成, 其中 50% 是⾼中學⽣, 50% 來
⾃⼈⼝普查局的⼯作⼈員. 測試樣本也
是同樣比例的⼿寫數字數據
• 60,000個訓練樣本
• 10,000測試樣本
• 這些數字已經標準化，並以固定⼤⼩
的影像 (28x28像素) 為中⼼

Source: Neural Network 3D Simulation

Source: https://keras.io/layers/core/

One-hot Encoding
train_labels = tf.keras.utils.to_categorical(train_labels)
Color
Red
Red
Yellow
Green
Yellow
Red Yellow Green
1 0 0
1 0 0
0 1 0
0 0 1
0 1 0

• Input/Output Representation
• Network Architecture
• Activation Functions
• Weight Initialization
• Optimizer
• Loss/Cost Function
• Evaluation Metrics
Source: CS230 - Andrew Ng
遺珠之憾

20200323 - AI Intro

Recommended

Recommended

More Related Content

Similar to 20200323 - AI Intro

Similar to 20200323 - AI Intro (20)

More from Jamie (Taka) Wang

More from Jamie (Taka) Wang (20)

20200323 - AI Intro