神經網路的基礎--
MLP(Multi-Layer Perceptron)
17
Artificial neural network
Optimizer
Mini-batch
Activation functions
Loss functions
Batch Normalization
Avoid Overfitting: Weight Decay, Dropout
MLP(Multi-Layer Perceptron)
18
A single neuron (perceptron)
19
AND, OR gate use one Perceptron
x1
x2
F
0.5
0.5
- 0.7AND -0.5
x1
x2
1
1
OR
F
X1 X2 Y
--------------------
0 0 0
0 1 0
1 0 0
1 1 1
X1 X2 Y
--------------------
0 0 0
0 1 1
1 0 1
1 1 1
Quiz: XOR gate
x1
x2
F
?
?
?XOR
OR
X1 X2 Y
--------------------
0 0 0
0 1 1
1 0 1
1 1 0
Can XOR gate use only one Perceptron ?!
Single Perceptron == 線性
OR
(0,0) (1,0)
(0,1)
(1,1)
(0,0) (1,0)
(0,1)
(1,1)
AND OR
0.5X1+0.5X2-0.7=0
X1
X2
X1
X2
(1.4,0)
-0.5X1-0.5X2+0.7=0
X1+X2-0.5=0
(0.5,0)
(0,0) (1,0)
(0,1)
(1,1)
XOR
X1
X2
Lab: keras/concept/NN_concept.ipynb
Feature transformation
Feature transformation
增加隱藏層的效果可做更難的分類
增加隱藏層
https://www.intechopen.com/books/artificial-neural-networks-architectures-
and-applications/applications-of-artificial-neural-networks-in-chemical-
problems
串聯更多的perceptron Neural Network
27
x1
x2
a1
a2
w11
w21
w12
w22
1b1
a3
w13
w23
2b2
b3b3
= + +
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
= ( ) ( ) ( )
A=XW+B
A simple Function
Multi-layer neural network
MLP uses multiple hidden layers between the input
and output layers to extract meaningful features
A Neural Network = A Function
MLP(Multi-Layer Perceptron)
28
2 layers Neural Network
29
x1
x2
𝑎
( )
𝑎
( )
𝑎
( )
𝑎
( )
𝑏
( )
𝑏
( )
( )( )
𝑏
( )
𝑏
( )
𝑏
( )
𝑏
( )
y1
y2
𝑎
( )
𝑎
( )
𝑎
( )
𝑎
( )
𝑏
( )
𝑏
( )
( )( )
𝑏
( )
𝑏
( )
𝑏
( )
𝑏
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
=
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
=
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
=
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )
𝑤
( )2x3
3x3 3x2
1x2 1x2
( )
=
( )
+
( )
+
( )
𝐵
( )
= 𝑏
( )
𝑏
( )
𝑏
( )
Training neural networks
30
Find network weights to minimize the training
error between true and estimated labels of training
examples, e.g.:
Training of multi-layer networks
31
Back-propagation: gradients are computed in the
direction from output to input layers and
combined using chain rule
SGD(Stochastic gradient descent): compute the
weight update w.r.t. one training example at a time,
cycle through training examples in random order in
multiple epochs  Slow Convergence
每次隨機選一個樣本,一筆一筆去更新很慢
• mini-batch SGD (a batch of samples computed
simultaneously)  faster to complete one epoch
Optimizer
32
33
Mini-batch is expected to be called several times
consecutively on different chunks of a dataset so as to
implement out-of-core or online learning.
This is especially useful when the whole dataset is too
big to fit in memory at once.
Mini-batch vs. Epoch
*一個epoch = 看完所有training data 一次
*依照mini-batch 把所有training data 拆成多份
假設全部有1000 筆資料
batch size = 100 可拆成10 份 一個epoch 內會更新10 次
batch size = 10 可拆成100 份 一個epoch 內會更新100 次
*如何設定batch size?
不要設太大,常用28, 32, 128, 256, …
mini-batch: partial fit method
34
Overview of Neural Network
35
To avoid falling into the local minimum and further
increase the training speed
Adaptive Learning Rate/Gradient algorithms
1. Adagrad
2. Momentum
3. RMSProp
4. Adam
5. …
Adaptive Learning Rate/Gradient algorithms
36

5.MLP(Multi-Layer Perceptron)