Deep learning lecture - part 1 (basics, CNN)

Deep Learning
Lecture (1)
19.10.22 You Sung Min
Bengio, Yoshua, Ian Goodfellow, and Aaron
Courville. Deep learning. Vol. 1. MIT press, 2017.

0. Introduction
1. Why neural networks?
1. What is the neural network?
2. Universal approximation theorem
3. Why deep neural network?
2. How the network learns
1. Gradient descent
2. Backpropagation
3. Modern deep learning
1. Convolutional neural network
2. Recurrent neural network
Contents

Example of deep learning model
Introduction
Image source : Zeiler & Fergus, 2014

Artificial intelligence
Introduction

History of deep learning
Introduction
Backpropagation
Distributed representation
(1986)
Deep
learning
(2006)
LSTM
(1997)Biological
learning
(1943)
Neocognitron
(1980)
Perceptron
(1958) Stochastic
gradient descent
(1960)

 Size of dataset
Introduction

 Connections per neuron
Introduction
10: GoogleNet
(2014)

 Number of neurons
Introduction
1. Perceptron
20. GoogleNet

Structure of perceptron (Developed in 1950s)
Why neural networks?
=
𝟎 𝒊𝒇
𝒋
𝝎𝒋 𝒙𝒋 ≤ 𝑻
𝟏 𝒊𝒇
𝒋
𝝎𝒋 𝒙𝒋 > 𝑻
𝝎 𝟏
𝝎 𝟐
𝝎 𝟑
𝒋
𝝎𝒋 𝒙𝒋Binary Inputs
Threshold T
𝒋
𝝎𝒋 𝒙𝒋 − 𝑻 ≤ 𝟎
or
𝒋
𝝎𝒋 𝒙𝒋 − 𝑻 > 𝟎
𝒛 =
𝒋
𝝎𝒋 𝒙𝒋 + 𝒃 𝒐𝒖𝒕𝒑𝒖𝒕 𝒚 = 𝝓(𝒛), where
𝝓 is called activation ftn.
output of a single neuron 𝒚 = 𝝓( 𝒋 𝝎𝒋 𝒙𝒋 + 𝒃)

Multilayer perceptron (MLP)
𝝎 𝟏
𝟏
𝝎𝒊
𝒋
𝒚 𝟏
𝟐
𝒙 𝟏
𝒙 𝟐
𝒙𝒊
𝒚𝒋
𝟏
𝒚 𝟐
𝟏
𝒚 𝟏
𝟏
𝒚𝒋
𝟏
= 𝝓(
𝒊
𝝎𝒊
𝟏
𝒙𝒊 + 𝒃𝒋
𝟏
)
𝒚 𝟏
𝟐
= 𝝓(
𝒊
𝝎𝒊
𝟐
𝒚𝒊
𝟏
+ 𝒃𝒋
𝟐
)
𝒚 𝟑
𝝎 𝟏
𝟐
𝝎 𝟏
𝟑
𝒚 𝟑 = 𝝓(
𝒊
𝝎𝒊
𝟑
𝒚𝒊
𝟐
+ 𝒃𝒋
𝟑
)
𝑭 𝒙 = 𝝓
𝒊
𝝎𝒊
𝟑
𝝓(
𝒊
𝝎𝒊
𝟐
𝝓(
𝒊
𝝎𝒊
𝟏
𝒙𝒊 + 𝒃𝒋
𝟏
) + 𝒃𝒋
𝟐
) + 𝒃𝒋
𝟑
Output of a network

Universal approximation theorem (보편 근사정리)
⇒ For any subset of ℝ 𝒏, any continuous function f can be
approximated with a feedforward neural network
that has at least a single hidden layer
⇒ 하나의 은닉층을 갖는 신경망은 임의의 연속인 다변수 함
수를 원하는 정도로 근사 할 수 있다
𝑭 𝒙 =
𝒊=𝟏
𝑵
𝒗𝒊 𝝋 𝑾𝒊
𝑻
𝒙 + 𝒃𝒊
, where φ is ℝ → ℝ, nonconstant,
bounded , continuous function
𝑭 𝒙 − 𝒇 𝒙 < 𝝐 for all 𝒙 ∈ 𝒔𝒖𝒃𝒆𝒕 𝒐𝒇 ℝ 𝑴

⇒ Regardless of what function we are trying to learn,
a large MLP will be able to represent that function
But not guaranteed that the training algorithm is able to
learn that function
1. Optimization algorithm may fail to find parameters
(weight)
2. Training algorithm might choose wrong function
due to overfitting (fail generalization)
: There is no universal procedure to train and generalize
a function (no free lunch theorem; Wolpert, 1996)

⇒ A feed forward with a single hidden layer is sufficient to
represent any function. But the layer may be large and may
fail to learn and generalize correctly
 Why deep neural network?
In many case, deeper model can reduce the required number
of units (neuron) and the amount of generalization error

Why deep neural network?
Effect of depth (Goodfellow et al., 2014)
 Street View House Numbers (SVHN) database
Number of depth
Goodfellow, Ian J., et al. "Multi-digit number recognition from street view imagery using
deep convolutional neural networks." arXiv preprint arXiv:1312.6082 (2013)

Curse of dimensionality (→ statistical challenge)
Let dimension of data space as d
Required number of sample to inference : n
Generally in practical task: 𝐝 ≫ 𝒏 𝟑
Image source : Nicolas Chapados
d = 10
𝒏 𝟏
d = 𝟏𝟎 𝟐
𝒏 𝟐
d = 𝟏𝟎 𝟑
𝒏 𝟑
𝒏 𝟏 < 𝒏 𝟐 ≪ 𝒏 𝟑

Local constancy prior (smoothness prior)
 For 𝒙 as an input sample and small change of ε,
the well-trained function 𝒇 should satisfy
𝒇∗
𝒙 ≈ 𝒇∗
𝒙 + 𝝐

Local constancy prior (smoothness prior)
Models with local kernel at samples
𝑶(𝒌) sample is required to distinguish 𝑶(𝒌) regions
Deep learning spans data into subspaces
(Distributed representation)
Data was generated by the composition of factors (or
features), potentially at multiple levels in a hierarchy
Voronoi diagram
(nearest-neighborhood)

Manifold hypothesis
Manifold : a connected set of points that can be
approximated well by considering only a small
number of degree of freedom (or dimensions) in a
higher-dimensional space

Real world data(sound, image, text etc.) are highly
concentrated
Random samples in the image space

Even though the data space is ℝ 𝒏, we don’t have to
consider all the space
We may consider only neighborhood of the observed
samples along with some manifolds
A transfer may exist along the manifold
For example, intensity change in images
 Manifolds related human face and those related with cat
may different

Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with
deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015)

 Non-linear transform by learning
Linear model: linear combination of input 𝑿
⇒ Linear model with non-linear transform 𝝓(𝑿) as
input
Finding an optimal 𝝓 𝑿
Previous: human knowledge-based transform
(i.e., handcrafted features)
Deep learning: learning inside the network
𝒚 = 𝒇 𝒙; 𝜽, 𝝎 = 𝝓(𝒙; 𝜽) 𝑻 𝝎

A hidden layer
𝒚 = 𝒇 𝒙; 𝜽, 𝝎 = 𝝓(𝒙; 𝜽) 𝑻 𝝎

Summary
Curse of dimensionality
Local constancy prior
Nonlinear transform by learning
Dimension of the data space can
be reduced as subsets of manifold
The number of decision regions
can be spanned with the subspaces
as composition of factors

Learning of the network
To approximate a function 𝒇∗
Classifier 𝒚 = 𝒇∗(𝒙), where 𝒚𝒊 ∈ 𝒇𝒊𝒏𝒊𝒕𝒆 𝒔𝒆𝒕
Regression 𝒚 = 𝒇∗
(𝒙), where 𝒚𝒊 ∈ ℝ 𝒅
 A network defines a mapping 𝒚 = 𝒇(𝒙; 𝜽) and
learns parameters 𝜽 which approximate the function 𝒇∗
Due to the non-linearity, the global optimization
algorithm (such as convex optimization) is not proper to
the deep learning → Update cost function 𝑪
Gradient descent
Backpropagation
How the network learns

Gradient descent
𝒇 𝟏: ℝ → ℝ
𝒇 𝟐: ℝ 𝒏 → ℝ

Directional derivative of 𝒇 at 𝒖 direction
𝝏
𝝏𝜶
𝒇 𝒗 + 𝜶𝒖 = 𝒖 𝑻 𝛁𝒗 𝒇(𝒗)
→ min
𝒖
cos 𝜽 , 𝒘𝒉𝒆𝒓𝒆 𝜶 = 𝟎
Moving toward negative gradient decreases 𝒇
𝒇
𝒗′ = 𝒗 − 𝜼𝛁𝒗 𝒇(𝒗)
(𝜼 ∶ 𝒍𝒆𝒂𝒓𝒏𝒊𝒏𝒈 𝒓𝒂𝒕𝒆)

Backpropagation
Error backpropagation path
𝒙 𝒚 = 𝒈(𝒙)
𝒅𝒛
𝒅𝒙
=
𝒅𝒛
𝒅𝒚
𝒅𝒚
𝒅𝒙
𝒛 = 𝒇 𝒈 𝒙
= 𝒇(𝒚)y
𝒛
by chain-rule

Backpropagation
For 𝒙 ∈ ℝ 𝒎
, 𝒚 ∈ ℝ 𝒏
and 𝒈: ℝ 𝒎
→ ℝ 𝒏
, 𝒇: ℝ 𝒏
→ ℝ
From gradient descent,
𝒅𝒛
𝒅𝒙
=
𝒅𝒛
𝒅𝒚
𝒅𝒚
𝒅𝒙
𝝏𝒛
𝝏𝒙𝒊
=
𝒋
𝝏𝒛
𝝏𝒚𝒋
𝝏𝒚𝒋
𝝏𝒙𝒊
𝛁𝒙 𝒛 = (
𝝏𝒚
𝝏𝒙
) 𝑻
𝛁𝒚 𝒛
𝝏𝒚
𝝏𝒙
: 𝒏 × 𝒎 Jacobian
matrix of 𝒈
𝒙′ = 𝒙 − 𝜼(
𝝏𝒚
𝝏𝒙
) 𝑻 𝛁𝒚 𝒛 𝜽′ = 𝜽 − 𝜼(
𝝏𝒚
𝝏𝜽
) 𝑻 𝛁𝒚 𝒛

Universal approximation theorem
Gradient descent & Backpropagation
Practical reason of fail
Optimization
Optimizer (SGD, AdaGrad, RMSprop, Adam, etc.)
Weight initialization
Regularization
Parameter norm penalty (𝑳 𝟐
, 𝑳 𝟏
)
Augmentation / Noise input (weight noise, label smoothing)
Multitask learning
Parameter sharing (CNN)
Ensemble / Dropout
Adversarial training
Domain specific prior

Convolutional neural network
Convolution vs cross-correlation
Convolution
Cross-correlation
Modern deep learning
𝑺 𝒊, 𝒋 = 𝑰 ∗ 𝑲 𝒊, 𝒋 =
𝒎 𝒏
𝑰 𝒎, 𝒏 𝑲(𝒊 − 𝒎, 𝒋 − 𝒏)
= 𝑲 ∗ 𝑰 𝒊, 𝒋 =
𝒎 𝒏
𝑰 𝒊 − 𝒎, 𝒋 − 𝒏 𝑲(𝒎, 𝒏)
𝑺 𝒊, 𝒋 = 𝑰 ∗ 𝑲 𝒊, 𝒋 =
𝒎 𝒏
𝑰 𝒊 + 𝒎, 𝒋 + 𝒏 𝑲(𝒎, 𝒏)
Most of CNN actually uses cross-correlation not convolution

Significant characteristics of CNN
 Sparse interaction
 Parameter sharing
 Equivariant representation
Sparse interaction
 Kernel size ≪ input size (e.g., 128-by-128 image and 3-by-3 kernel)
 For 𝒎 − 𝒊𝒏𝒑𝒖𝒕 and 𝒏 − 𝒐𝒖𝒕𝒑𝒖𝒕,
fully connected network: 𝑶 𝒎 × 𝒏
CNN: 𝑶 𝒌 × 𝒏 , 𝐰𝐡𝐞𝐫𝐞 𝐤 𝐢𝐬 𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐜𝐨𝐧𝐧𝐞𝐜𝐭𝐢𝐨𝐧𝐬
 Practically, k has several orders of magnitude smaller than m
CNN fully connected network Receptive field of CNN

Parameter sharing
 Learning only a set of parameters (kernel) for every location
 Reduce the required amount of memory
fully connected networkCNN
Calculation : 4 billion times efficient
Memory storage: 178,640 for matrix multiplication
Vertical
edge

Equivariant representation
(translation equivariant)
 Translation in input → translation in output
Location of output (feature)
related to cat

Pooling (translation invariance)
Tasks that care more about whether some features
exist than exactly where they are

Prior belief of convolution and pooling
Ftn. the layer should learn contains only local
interactions and is equivariant to translation
Ftn. the layers learns must be invariant to small
translations
C.f.) Inception module(Szegedy. 2015)
Capsule network(Hinton, 2017)

Historical meaning of CNN
Since the imageNet challenge of AlexNet(2012)

Historical meaning of CNN
First deep network that is trained and operated
well with backpropagation
Reason of success is not entirely clear
Efficiency of the computation time might give
chances to perform more experiments for the
tuning of the implementation and hyperparameters
CNN achieved states of the arts with the data that
has a clear grid-structured topology(such as image)

Deep learning lecture - part 1 (basics, CNN)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning lecture - part 1 (basics, CNN)

Similar to Deep learning lecture - part 1 (basics, CNN) (20)

Recently uploaded

Recently uploaded (20)

Deep learning lecture - part 1 (basics, CNN)

Editor's Notes