Digit Recognizer by Convolutional Neural Network (CNN)
Ding Li 2018.06
online store: costumejewelry1.com
2
MNIST database
(Modified National Institute of Standards and Technology database)
60,000 training images; 10,000 testing images
Kaggle Challenge
42,000 training images; 28,000 testing images
Predict
Input Image Pixel Value
Handwritten Digit
784 (28*28) Pixels
Pixel color coding [0,255]
Label
[0,9]
Input Result
https://en.wikipedia.org/wiki/MNIST_database
Handwritten Digits
28
28
28
28
3
Input
28*28
Conv 1
3*3
26*26*16
16 filters
Max pool 1
2*2
Conv 2
3*3
24*24*16
16 filters
12*12*16
Conv 3
3*3
10*10*32
32 filters
Max pool 2
2*2
Conv 4
3*3
32 filters
4*4*32
Flatten
512
Full
Connected
512 1024
Full
Connected
0
1
2
3
4
5
6
7
8
9
10
Max
Prediction
Probability
Dense 1
Relu
Dense 2
Relu
Dense 3
Full
Connected
Predicted
Label
[0,9]
Softmax
https://en.wikipedia.org/wiki/Convolutional_neural_network LeNet - 5 AlexNet VGG 16
8*8*32
https://en.wikipedia.org/wiki/Yann_LeCun
Trainable
Parameters:
814,778
4
10 10 10 0 0 0
10 10 10 0 0 0
10 10 10 0 0 0
10 10 10 0 0 0
10 10 10 0 0 0
10 10 10 0 0 0
*
1 0 -1
1 0 -1
1 0 -1
10 10 0
10 10 0
10 10 0
*
1 0 -1
1 0 -1
1 0 -1
10 10 0
10 10 0
10 10 0
1 0 -1
1 0 -1
1 0 -1
∑ ∑
10 0 0
10 0 0
10 0 0
= 30
0 30 30 0
0 30 30 0
0 30 30 0
0 30 30 0
Detected edge
No edge
https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
filter (kernel)
feature detector
Input
Output
5
*
1 0 -1
1 0 -1
1 0 -1
1 1 1
0 0 0
-1 -1 -1
Multiple convolutional filters (kernels) can capture different changes of neighborhood pixels.
https://aishack.in/tutorials/image-convolution-examples/
6
3 8 2 1
7 9 1 1
4 5 2 3
5 6 1 2
9 2
6 3
Max Pool
Pool after convolutional layer, reduce noise and data size.
Local information are concentrated into higher level information.
Besides max pool, there is also average pool.
Size: 2 x 2
7
Input x0, x1, x2
Output a
Parameters w0, w1, w2, b
(to be optimized)
𝑧 = w0 x0 + w1x1 + w2 x2 + b
a = f(z) activation function
Nonlinear increase certainty
ReLU Sigmoid
𝑓(𝑧) =
1
1 + 𝑒−𝑧
f(z) = max(0, z)
http://cs231n.github.io/neural-networks-1/
on
off
on
off
8
Fully Connected Layers
All the input from the previous layer are
combined at each node.
x0
x1
x2
x3
𝑎0
[1]
= 𝑓(𝑤0,0
1
∙ 𝑥0 + 𝑤1,0
1
∙ 𝑥1 + 𝑤2,0
1
∙ 𝑥2 + 𝑤3,0
1
∙ 𝑥3 + 𝑏0
1
)
All the local features extracted in previous
layers are fully connected with different
weights to construct global features.
Complicated relationship between input can
be revealed by deep networks.
https://github.com/drewnoff/spark-notebook-ml-labs/tree/master/labs/DLFramework
𝑎0
[1]
𝑎1
[1]
𝑎2
[1]
𝑎3
[1]
𝑎4
[1]
𝑎5
[1]
𝑎0
[2]
𝑎1
[2]
𝑎2
[2]
𝑎3
[2]
𝑎4
[2]
𝑎5
[2]
𝑎0
[3]
𝑎1
[3]
𝑎2
[3]
𝑎3
[3]
𝑎4
[3]
𝑎5
[3]
𝑎1
[1]
= 𝑓(𝑤0,1
1
∙ 𝑥0 + 𝑤1,1
1
∙ 𝑥1 + 𝑤2,1
1
∙ 𝑥2 + 𝑤3,1
1
∙ 𝑥3 + 𝑏1
1
)
…...
9
…………
𝑎0
[𝐿−1]
𝑎1
[𝐿−1]
𝑎2
[𝐿−2]
𝑎1023
[𝐿−1]
……………..….…………
𝑧0 = 𝑤0,0
𝐿
∙ 𝑎0
𝐿−1
+ 𝑤1,0
𝐿
∙ 𝑎1
𝐿−1
+ 𝑤2,0
𝐿
∙ 𝑎2
𝐿−1
+ ⋯ + 𝑤1023,0
𝐿
∙ 𝑎1023
[𝐿−1]
+ 𝑏0
𝐿
0
1
2
9
Linear combination of inputs from previous layer:
𝑦0 =
𝑒𝑧0
𝑒𝑧0+𝑒𝑧1+𝑒𝑧2+⋯+𝑒𝑧9
L-1 layer
L layer
Softmax, normalize the result:
𝑦0
𝑦1
𝑦2
𝑦9
𝑦0 + 𝑦1 + 𝑦2 + … + 𝑦9 = 1
Probability that 𝑦 = 0
𝑦 =[ 𝑦0, 𝑦1, 𝑦2, … 𝑦9 ]
Prediction:
True value: y = [ y0 , y1 , y2 , … y9]
E.g. y = 0
𝑦 =[ 𝟎. 𝟗, 0.02, 0.01, 0.02, … 0.04]
y = [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Loss function: 𝐿 𝑦, 𝑦 = −
𝑖=0
9
𝑦𝑖 log(𝑦𝑖) 𝐿 𝑦, 𝑦 = −1 ∗ log 0.9 = 0.046
L ≥ 0, 𝑎𝑡 𝑏𝑒𝑠𝑡 𝑚𝑎𝑡𝑐ℎ 𝑦 = y,
Cost function:
𝐿 𝑦, 𝑦 = −1 ∗ log 1 = 0
𝐽(𝑤, 𝑏) = 1
𝑚 𝑚 𝐿 𝑦,𝑦 𝑡𝑜𝑡𝑎𝑙 𝑙𝑜𝑠𝑠 𝑜𝑓 𝑚 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝐺𝑜𝑎𝑙: 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑡ℎ𝑒 𝑐𝑜𝑠𝑡 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛.
A Friendly Introduction to Cross-Entropy Loss
𝑤0,0
𝐿
𝑤1023,0
𝐿
10
…………
𝑎0
[𝐿−1]
𝑎1
[𝐿−1]
𝑎2
[𝐿−2]
𝑎1023
[𝐿−1]
……………..….…………
L-1 layer
L layer
𝑦0
𝑦1
𝑦2
𝑦9
𝑊[𝐿]
𝑏[𝐿]
𝑎0
[𝐿−2]
𝑎1
[𝐿−2]
𝑎2
[𝐿−2]
𝑎1023
[𝐿−2]
……………..….…………
L-2 layer
𝑊[𝐿−1]
𝑏[𝐿−1]
0
1
2
9
𝐺𝑜𝑎𝑙: 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑡ℎ𝑒 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 𝑦 𝑎𝑛𝑑 𝑡𝑟𝑢𝑒 𝑦.
1. With the initial parameters W and b, predict the label 𝑦 with
forward propagation, calculate the cost.
2. 𝑂𝑝𝑡𝑖𝑚𝑖𝑧𝑒 𝑡ℎ𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑜𝑓 𝐿 𝑙𝑎𝑦𝑒𝑟, 𝑊[𝐿]
& 𝑏[𝐿]
,
assuming inputs from L-1 layer, 𝐴[𝐿−1]
do not change.
3. 𝐶𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 𝑡ℎ𝑒 𝑐ℎ𝑎𝑛𝑔𝑒 𝑜𝑓 𝐿– 1 𝑙𝑎𝑦𝑒𝑟 𝑖𝑛𝑝𝑢𝑡 , 𝐴[𝐿−1]
,
which is needed to minimize the cost funtion,
assuming parameters 𝑊[𝐿]
& 𝑏[𝐿]
do not change.
4. 𝑂𝑝𝑡𝑖𝑚𝑖𝑧𝑒 𝑡ℎ𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑜𝑓 𝐿– 1 𝑙𝑎𝑦𝑒𝑟, 𝑊[𝐿−1] & 𝑏[𝐿−1] ,
𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑑𝑒𝑠𝑖𝑟𝑒𝑑 𝑐ℎ𝑎𝑛𝑔𝑒𝑠 of 𝐴[𝐿−1]
.
5. Proceed like this all the way to the first layer,
optimize the parameters W and b of all layers.
6. Running forward propagation and backpropagation once is called
one epoch, run multiple epochs until cost is near minimum value.
𝐴[𝐿−2] 𝐴[𝐿−1] Forward Propagation
Backpropagation https://en.wikipedia.org/wiki/Backpropagation https://en.wikipedia.org/wiki/Geoffrey_Hinton
1
1
This technique can force the network to learn features in a distributed way and reduces the overfitting.
Dropout applied after the two pool layers and first two full connected layers.
A proportion of nodes in the
layer are randomly ignored for
each training sample.
12
Some images are even hard for human to
recognize, more samples like these can help.
After 1 hour, 30 epochs’ training,
achieved 99.67% accuracy.
Can you predict the true value?
13
Some images are even hard for human to
recognize, more samples like these can help.
After 1 hour, 30 epochs’ training,
achieved 99.67% accuracy.
1
4
Input
Convolution Matrix, 16 filters
Generated from training
Output
1
5
Deeper More abstract
26*26*16 24*24*16 12*12*16
10*10*32 8*8*32 4*4*32
1
6
The machine will combined the 1024 final values to judge the label [0,9].
light
signals
Logical
signals
1
7
Can we understand machine’s ‘mind’?
1
8
• Convolutional Neural Network is very powerful for analyzing visual image.
• The convolutional layers can capture the local features.
• The pooling layers can concentrate the local changes, as well as reduce the noise and data size.
• The full connected layers can combine all the local features to generate global features .
• The global features are combined to make the final judgement, here the probability of label [0,9].
• Can human understand Artificial Neural Networks?
• Is there any similarity between brain and CNN to process the visual information?
• What is the meaning of local and global features generated by machines?
• Can human understand machines’ logic?
Python code of the project at kaggle: https://www.kaggle.com/dingli/digits-recognition-with-cnn-keras

Digit recognizer by convolutional neural network

  • 1.
    Digit Recognizer byConvolutional Neural Network (CNN) Ding Li 2018.06 online store: costumejewelry1.com
  • 2.
    2 MNIST database (Modified NationalInstitute of Standards and Technology database) 60,000 training images; 10,000 testing images Kaggle Challenge 42,000 training images; 28,000 testing images Predict Input Image Pixel Value Handwritten Digit 784 (28*28) Pixels Pixel color coding [0,255] Label [0,9] Input Result https://en.wikipedia.org/wiki/MNIST_database Handwritten Digits 28 28 28 28
  • 3.
    3 Input 28*28 Conv 1 3*3 26*26*16 16 filters Maxpool 1 2*2 Conv 2 3*3 24*24*16 16 filters 12*12*16 Conv 3 3*3 10*10*32 32 filters Max pool 2 2*2 Conv 4 3*3 32 filters 4*4*32 Flatten 512 Full Connected 512 1024 Full Connected 0 1 2 3 4 5 6 7 8 9 10 Max Prediction Probability Dense 1 Relu Dense 2 Relu Dense 3 Full Connected Predicted Label [0,9] Softmax https://en.wikipedia.org/wiki/Convolutional_neural_network LeNet - 5 AlexNet VGG 16 8*8*32 https://en.wikipedia.org/wiki/Yann_LeCun Trainable Parameters: 814,778
  • 4.
    4 10 10 100 0 0 10 10 10 0 0 0 10 10 10 0 0 0 10 10 10 0 0 0 10 10 10 0 0 0 10 10 10 0 0 0 * 1 0 -1 1 0 -1 1 0 -1 10 10 0 10 10 0 10 10 0 * 1 0 -1 1 0 -1 1 0 -1 10 10 0 10 10 0 10 10 0 1 0 -1 1 0 -1 1 0 -1 ∑ ∑ 10 0 0 10 0 0 10 0 0 = 30 0 30 30 0 0 30 30 0 0 30 30 0 0 30 30 0 Detected edge No edge https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ filter (kernel) feature detector Input Output
  • 5.
    5 * 1 0 -1 10 -1 1 0 -1 1 1 1 0 0 0 -1 -1 -1 Multiple convolutional filters (kernels) can capture different changes of neighborhood pixels. https://aishack.in/tutorials/image-convolution-examples/
  • 6.
    6 3 8 21 7 9 1 1 4 5 2 3 5 6 1 2 9 2 6 3 Max Pool Pool after convolutional layer, reduce noise and data size. Local information are concentrated into higher level information. Besides max pool, there is also average pool. Size: 2 x 2
  • 7.
    7 Input x0, x1,x2 Output a Parameters w0, w1, w2, b (to be optimized) 𝑧 = w0 x0 + w1x1 + w2 x2 + b a = f(z) activation function Nonlinear increase certainty ReLU Sigmoid 𝑓(𝑧) = 1 1 + 𝑒−𝑧 f(z) = max(0, z) http://cs231n.github.io/neural-networks-1/ on off on off
  • 8.
    8 Fully Connected Layers Allthe input from the previous layer are combined at each node. x0 x1 x2 x3 𝑎0 [1] = 𝑓(𝑤0,0 1 ∙ 𝑥0 + 𝑤1,0 1 ∙ 𝑥1 + 𝑤2,0 1 ∙ 𝑥2 + 𝑤3,0 1 ∙ 𝑥3 + 𝑏0 1 ) All the local features extracted in previous layers are fully connected with different weights to construct global features. Complicated relationship between input can be revealed by deep networks. https://github.com/drewnoff/spark-notebook-ml-labs/tree/master/labs/DLFramework 𝑎0 [1] 𝑎1 [1] 𝑎2 [1] 𝑎3 [1] 𝑎4 [1] 𝑎5 [1] 𝑎0 [2] 𝑎1 [2] 𝑎2 [2] 𝑎3 [2] 𝑎4 [2] 𝑎5 [2] 𝑎0 [3] 𝑎1 [3] 𝑎2 [3] 𝑎3 [3] 𝑎4 [3] 𝑎5 [3] 𝑎1 [1] = 𝑓(𝑤0,1 1 ∙ 𝑥0 + 𝑤1,1 1 ∙ 𝑥1 + 𝑤2,1 1 ∙ 𝑥2 + 𝑤3,1 1 ∙ 𝑥3 + 𝑏1 1 ) …...
  • 9.
    9 ………… 𝑎0 [𝐿−1] 𝑎1 [𝐿−1] 𝑎2 [𝐿−2] 𝑎1023 [𝐿−1] ……………..….………… 𝑧0 = 𝑤0,0 𝐿 ∙𝑎0 𝐿−1 + 𝑤1,0 𝐿 ∙ 𝑎1 𝐿−1 + 𝑤2,0 𝐿 ∙ 𝑎2 𝐿−1 + ⋯ + 𝑤1023,0 𝐿 ∙ 𝑎1023 [𝐿−1] + 𝑏0 𝐿 0 1 2 9 Linear combination of inputs from previous layer: 𝑦0 = 𝑒𝑧0 𝑒𝑧0+𝑒𝑧1+𝑒𝑧2+⋯+𝑒𝑧9 L-1 layer L layer Softmax, normalize the result: 𝑦0 𝑦1 𝑦2 𝑦9 𝑦0 + 𝑦1 + 𝑦2 + … + 𝑦9 = 1 Probability that 𝑦 = 0 𝑦 =[ 𝑦0, 𝑦1, 𝑦2, … 𝑦9 ] Prediction: True value: y = [ y0 , y1 , y2 , … y9] E.g. y = 0 𝑦 =[ 𝟎. 𝟗, 0.02, 0.01, 0.02, … 0.04] y = [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] Loss function: 𝐿 𝑦, 𝑦 = − 𝑖=0 9 𝑦𝑖 log(𝑦𝑖) 𝐿 𝑦, 𝑦 = −1 ∗ log 0.9 = 0.046 L ≥ 0, 𝑎𝑡 𝑏𝑒𝑠𝑡 𝑚𝑎𝑡𝑐ℎ 𝑦 = y, Cost function: 𝐿 𝑦, 𝑦 = −1 ∗ log 1 = 0 𝐽(𝑤, 𝑏) = 1 𝑚 𝑚 𝐿 𝑦,𝑦 𝑡𝑜𝑡𝑎𝑙 𝑙𝑜𝑠𝑠 𝑜𝑓 𝑚 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝐺𝑜𝑎𝑙: 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑡ℎ𝑒 𝑐𝑜𝑠𝑡 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛. A Friendly Introduction to Cross-Entropy Loss 𝑤0,0 𝐿 𝑤1023,0 𝐿
  • 10.
    10 ………… 𝑎0 [𝐿−1] 𝑎1 [𝐿−1] 𝑎2 [𝐿−2] 𝑎1023 [𝐿−1] ……………..….………… L-1 layer L layer 𝑦0 𝑦1 𝑦2 𝑦9 𝑊[𝐿] 𝑏[𝐿] 𝑎0 [𝐿−2] 𝑎1 [𝐿−2] 𝑎2 [𝐿−2] 𝑎1023 [𝐿−2] ……………..….………… L-2layer 𝑊[𝐿−1] 𝑏[𝐿−1] 0 1 2 9 𝐺𝑜𝑎𝑙: 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑡ℎ𝑒 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 𝑦 𝑎𝑛𝑑 𝑡𝑟𝑢𝑒 𝑦. 1. With the initial parameters W and b, predict the label 𝑦 with forward propagation, calculate the cost. 2. 𝑂𝑝𝑡𝑖𝑚𝑖𝑧𝑒 𝑡ℎ𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑜𝑓 𝐿 𝑙𝑎𝑦𝑒𝑟, 𝑊[𝐿] & 𝑏[𝐿] , assuming inputs from L-1 layer, 𝐴[𝐿−1] do not change. 3. 𝐶𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 𝑡ℎ𝑒 𝑐ℎ𝑎𝑛𝑔𝑒 𝑜𝑓 𝐿– 1 𝑙𝑎𝑦𝑒𝑟 𝑖𝑛𝑝𝑢𝑡 , 𝐴[𝐿−1] , which is needed to minimize the cost funtion, assuming parameters 𝑊[𝐿] & 𝑏[𝐿] do not change. 4. 𝑂𝑝𝑡𝑖𝑚𝑖𝑧𝑒 𝑡ℎ𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑜𝑓 𝐿– 1 𝑙𝑎𝑦𝑒𝑟, 𝑊[𝐿−1] & 𝑏[𝐿−1] , 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑑𝑒𝑠𝑖𝑟𝑒𝑑 𝑐ℎ𝑎𝑛𝑔𝑒𝑠 of 𝐴[𝐿−1] . 5. Proceed like this all the way to the first layer, optimize the parameters W and b of all layers. 6. Running forward propagation and backpropagation once is called one epoch, run multiple epochs until cost is near minimum value. 𝐴[𝐿−2] 𝐴[𝐿−1] Forward Propagation Backpropagation https://en.wikipedia.org/wiki/Backpropagation https://en.wikipedia.org/wiki/Geoffrey_Hinton
  • 11.
    1 1 This technique canforce the network to learn features in a distributed way and reduces the overfitting. Dropout applied after the two pool layers and first two full connected layers. A proportion of nodes in the layer are randomly ignored for each training sample.
  • 12.
    12 Some images areeven hard for human to recognize, more samples like these can help. After 1 hour, 30 epochs’ training, achieved 99.67% accuracy. Can you predict the true value?
  • 13.
    13 Some images areeven hard for human to recognize, more samples like these can help. After 1 hour, 30 epochs’ training, achieved 99.67% accuracy.
  • 14.
    1 4 Input Convolution Matrix, 16filters Generated from training Output
  • 15.
    1 5 Deeper More abstract 26*26*1624*24*16 12*12*16 10*10*32 8*8*32 4*4*32
  • 16.
    1 6 The machine willcombined the 1024 final values to judge the label [0,9]. light signals Logical signals
  • 17.
    1 7 Can we understandmachine’s ‘mind’?
  • 18.
    1 8 • Convolutional NeuralNetwork is very powerful for analyzing visual image. • The convolutional layers can capture the local features. • The pooling layers can concentrate the local changes, as well as reduce the noise and data size. • The full connected layers can combine all the local features to generate global features . • The global features are combined to make the final judgement, here the probability of label [0,9]. • Can human understand Artificial Neural Networks? • Is there any similarity between brain and CNN to process the visual information? • What is the meaning of local and global features generated by machines? • Can human understand machines’ logic? Python code of the project at kaggle: https://www.kaggle.com/dingli/digits-recognition-with-cnn-keras