Machine learning and_neural_network_lecture_slide_ece_dku

Machine Learning and Neural Network
Course introduction
Seokhyun Yoon, Electronics Eng., Dankook Uinversity

Machine learning: Course introduction
 Target audience
 Senior undergraduate
 First year graduate student
 Prerequisite
 Linear algebra (선형대수 혹은 공학수학2)
 Basic probability and statistics (확률및통계학)
 Basic Python programming
 Textbook
 기계학습과 인공신경망 개론 (Ver1.xx)
 Download: https://www.slideshare.net/SeokhyunYoon1/
2019-09-26 2Machine learning and artificial neural network

Ch.1: Introduction

Ch.1 ML Introduction
 Objective: get “feel” and terminologies
1. What is machine learning? Concept and applications
2. What problems can ML solve?
 Classification, regression and clustering
 Supervised and unsupervised learning
3. Key elements of ML
 Data, Model and Cost
4. Design steps and issues in performance evaluation

Machine learning: Introduction
 Major applications
 Pattern classification: Character/Speech recognition
 Object detection and tracking
 Time-series prediction (Stock price/market prediction,
weather forecast)
 Sentence completion and language translation
 … and much more
 Problems in machine learning
 Classification
 Regression
 Clustering

Related fields
Machine learning
Probability and
statistics
Data science
Cognitive
science
Artificial
Intelligence
Computer science
Big data
Data mining
Linguistics
Psychology, neuro-science
Neural Network

 Elements of machine learning in classification and
regression problem
 Prediction model ( ) with parameters ( )
 Data (observations and their target values )
 Cost(loss)/Objective function to minimize/maximize
 Algorithm to efficiently obtain the optimal or a good
solution
Data: 𝒊 𝟏
𝑵
Model with parameters:
Cost/loss:
Algorithm to solve ∗
𝜽

 Machine learning process
Existing data
𝒊 𝟏
𝑵
Machine learning
Algorithm
∗
𝜽
Model
(with parameters)
∗
New data Prediction
∗

 Classification and regression
cat (Smiling)
X
y
1 2 3 4 5 6
1
2
3 X = 2.5, y=?
X = 6.0, y=?
Existing dataNew data

 Given an observation (which can be a vector, a
matrix (image) or a tensor)
 Classification determines its class among a set of classes
 Regression estimates/predicts unobserved variables
 Regression can be a prediction of future trend or
interpolation of some missing information
 Classification vs. regression
 In classification, is a discrete, categorical value drawn
from a finite set
 In regression, is a numerical value

 Machine learning is all about to find and
 How to find the best or, at least, a good ?
 Given , how to find the best or, at least, a good ?
 The best or a good for what and in what sense ?
 Why do we need pre-collected data for learning/training ?
Data: 𝒊 𝟏
𝑵
Cost/loss:
𝜽

 Some terminologies
 Learning/Training/Model fitting: process to find the model
parameters ( ) that best fit to given data in terms of the
predefined cost/objective
 Supervised learning: target values ( ) are provided
• Classification, regression
 Unsupervised learning: no target values provided
• Clustering
Data: 𝒊 𝟏
𝑵
Cost/loss:
𝜽

 Design steps (supervised learning)
1. Define the function you want to implement (define input
and output )
2. Design your model , intuitively and smartly
3. Collect data and curate them to set
4. Train the model to get ∗
5. Use ∗
to evaluate the performance
6. If satisfied, you are done! Otherwise, go to step 2 (skip 3).
 Step 2 requires strong/some mathematical background
 Step 3 is typically time-consuming and sometimes requires
domain expertise (e.g. for medical application)

 Design steps for beginner (supervised learning)
1. Choose a function you want to implement (input/output
formats are pre-defined)
2. Search for some open SW packages to choose/construct
an appropriate model and try to modify slightly
3. Download dataset ( , ) from the internet
4. Use the packages to train the model to get ∗
5. Use ∗ to evaluate the performance
6. If satisfied, you are done! Otherwise, go to step 2 (skip 3).

 Parameters and hyper parameters
 Most of the models have some hyper-parameters that
are pre-defined before training
 Must be optimized for performance, computing costs …
 may need grid search to find the best combination of
hyper parameters.

 Performance evaluation of classifier/regressor
 Must consider “generalization error”
 Typical performance measures
 Classification: Accuracy
 Regression: Mean Squared Error , R2 measure
그림 1.1 분류기/추정기의 학습과 테스트

 Clustering
 No target values for observations
 Objective is to divide data into a set of groups based on
some similarity measures
 Need to devise procedures to efficiently group data
 Data (distribution) visualization may help
 Once clustered, the data can be used for classification

 Two typical similarity measures
 Euclidian distance:
 Correlation:
𝒙 𝒙
𝒙 𝒙
 Need to consider symmetricity and their ranges
 Note
 L-p norm of a vector:
/
 Default value of p = 2 
 Schwartz’s inequality:

 Simplest classifier: k nearest neighbor (knn) classifier
 Training data 𝒊 𝟏
𝑵
used as templates
 Given new input data , it determines its class as follows
1. Compute (may use other similarity measure)
2. Select k candidates nearest to
3. Use majority vote to determine the class of
Existing data
𝒊 𝟏
𝑵
knn classifierNew data Prediction

 k nearest neighbor (knn) as regressor
 Training data 𝒊 𝟏
𝑵
used as templates
 Given new input data , it determines its class as follows
1. Compute (may use other similarity measure)
2. Select k candidates nearest to
3. Take average of k candidates to determine the estimates
Existing data
𝒊 𝟏
𝑵
knn regressorNew data Prediction

Ch.2: Data and descriptive statistics

Ch.2 Data and descriptive statistics
 Topics
1. Data: types and representation
2. Descriptive statistics
 Scatter plot and histogram
 Mean, correlation and covariance

Data and descriptive stat.
 Terminologies and notation
 Observation/sample/feature vector (for now, assume that
it is a vector).
 Target value : desired value for a sample
 In supervised learning, and should be paired ( , )
 Collection of data:
Each column is a sample
each row is a feature

 Two types of data:
 Categorical
 Numerical
 Categorical value is typically mapped to an integer
to make it suitable for computation
 ex: T  1, F  0
 Blood type: O  0, A  1, B  2, AB  3

 An example of multivariate (다변량) data
 Data consisting of 20 samples
 Each column is one sample with 4 features, (Group, English,
Math, Science score)  call it feature vector
 where Group is categorical and others are numerical
sid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ave. SD.
Group A A A A A A A A A A B B B B B B B B B B
English 77 81 74 89 78 77 75 84 79 82 69 76 74 67 74 71 67 67 70 69 75 6.09
Math 72 67 74 64 71 67 72 68 75 70 83 78 76 82 80 77 83 84 80 80 75 6.09
Science 75 68 72 68 74 68 75 73 79 72 78 75 72 74 77 70 75 79 77 76 74 3.47
a sample/observation

 Example problems
 Classification:
Given , determine
 Regression:
Given , estimate
sid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ave. SD.
Group A A A A A A A A A A B B B B B B B B B B
English 77 81 74 89 78 77 75 84 79 82 69 76 74 67 74 71 67 67 70 69 75 6.09
Math 72 67 74 64 71 67 72 68 75 70 83 78 76 82 80 77 83 84 80 80 75 6.09
Science 75 68 72 68 74 68 75 73 79 72 78 75 72 74 77 70 75 79 77 76 74 3.47
a sample/observation

 Data visualization: scatter plot and histogram
 Empirical (Probability)
density gives us lots of
information for the
design and performance
of classifier, regressor
and clustering algorithm
 One or two dimensional
(bivariate) data is easy
to visualize
 While, more than 2D is
hard
 Pairwise scatter plot is
affordable for small M

 Problems in machine learning
Classification
Regression Clustering
Few prob. distribution
models can be successfully
applied to practical dataset
That’s why we resort to
machine learning based on
a collection of samples

 Mean, Correlation and Covariance
 Consider dataset N samples and M features
 (Per-feature) mean:
 (Per-feature) variance:
 is the standard deviation
 ’s and ‘s can be collectively represented as a vector

 Mean, Correlation and Covariance
 Dataset N samples and M features
 Correlation (for a pair of features):
 Covariance (for a pair of features):
 , (symmetric)
 ’s and ’s can be collectively represented as matrices

 Mean, Correlation matrix and Covariance matrix
 Consider dataset N samples and M features
 Mean (vector): 𝑿
 Correlation matrix: 𝑿𝑿
 Covariance matrix: 𝑿𝑿 𝑿𝑿 𝑿
 Cross correlation: 𝑿𝒚
 Cross covariance: 𝑿𝒚 𝑿𝒚 𝑿 𝒚
Size:
Size:

 Properties of (and )
 𝑿𝑿
𝑻
𝑿𝑿 (symmetric)
 𝑿𝑿 is non-negative definite, such that, for any vector ,
𝑿𝑿
 The eigen values are all non-negative and their eigen vectors
form an orthonormal basis, i.e., with eigen decomposition
𝑿𝑿 ,
diagonal elements of are all non-negative real and
 𝑿𝑿
 If (the number of samples is less than the number of
features), then 𝑿𝑿 has at most non-zero eigen values. (all
others are zero). In this case, 𝑿𝑿 is not invertible
 These properties also hold for 𝑿𝑿

 For the two data matrices
and ,
 Find 𝑿
 Find 𝑿𝑿 and 𝑿𝑿
 Check if 𝑿𝑿 and 𝑿𝑿 satisfies the properties in the
previous slide.

 Example (문제 2.2)
 Find the correlation and covariance between
• English and math
• English and science
• Math and science
 Find 𝑿𝑿 and 𝑿𝑿
 Check if 𝑿𝑿 and 𝑿𝑿 satisfies the properties in the
previous slide.
sid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ave. SD.
English 77 81 74 89 78 77 75 84 79 82 69 76 74 67 74 71 67 67 70 69 75 6.09
Math 72 67 74 64 71 67 72 68 75 70 83 78 76 82 80 77 83 84 80 80 75 6.09
Science 75 68 72 68 74 68 75 73 79 72 78 75 72 74 77 70 75 79 77 76 74 3.47

Homework & Computer Lab.
 Homework: 2.1, 2.2

Ch.3: Multi-variate Gaussian PDF
and linear transform

Ch.3 Multivariate Gaussian PDF & linear transform
 Topics
1. Multi-variate Gaussian PDF
 Pearson’s correlation coefficient
2. Linear transformation
 Principal axes transform and whitening
3. Principal component analysis (PCA)

Multivariate Gaussian PDF: definition
 Definition of multivariate Gaussian (Normal) PDF
 Consider a Gaussian random vector 𝑻
 The PDF of is defined, in general, as
/ 𝑪 /
𝑻
where
is mean,
is covariance matrix,
 Note
 Quadratic form (2차형식) 𝑻
is a scalar ( , ×
)
 Mahalanobis distance: 𝑻
(Symmetric)

Multivariate Gaussian PDF
 3 cases of bivariate Gaussian (Normal) PDF
 Case 1: 𝝁 =
0
5
, 𝑪 =
9 0
0 9
, Case 2: 𝝁 =
0
5
, 𝑪 =
1 0
0 16
 Case 3: 𝝁 =
0
5
, 𝑪 =
9 −10
−10 16
Mean is just a
“translation”
Contour plot

 Let’s take a closer look
 “Contour” can be obtained from 𝑻
 Suppose that for simplicity  𝑻
 Suppose also that  𝟏 𝟏
 Then, we have
𝑧
𝜎
− 2𝜌
𝑧
𝜎
𝑧
𝜎
+
𝑧
𝜎
= 𝑐′(1 − 𝜌 )
 where is Pearson correlation coefficient defined as
satisfying
 We say
and are uncorrelated if
and has perfect correlation if
This is an ellipse

 Examples
 Pearson correlation coefficient between two random
variables (two features) and is defined as
satisfying
 We say that
and are uncorrelated if
and has perfect correlation if

 What can you see?
 Are Math and English
scores correlated ?
 What can you say
about Math and English
score? Set up your
hypothesis.
 Use the figure in the
previous page to
roughly estimate the
Pearson correlation
coefficient.

Multivariate Gaussian PDF (참고사항)
 Marginalization of an M-variate Gaussian PDF is also
a Gaussian PDF with (M-1)-variates
𝒊 𝒊
 Successive marginalization gives us a univariate
Gaussian PDF


Linear transform
 Definition of a linear transformation
 For any matrix of size (KxM), linear transform of a vector
of size (Mx1) is defined as
 Linear transform is a projection of onto the row space of
 Linear transform of a Gaussian random vector
 Suppose that be a Gaussian RV with mean and cov. , i.e.,
 Then, for any matrix , the linear transform is also
Gaussian with mean and covariance , i.e.,
 Try to verify using the def. of mean and covariance in Ch.2

Linear transform
 Principal axes transformation and Whitening
 Suppose that (eigen-decomposition of ) ,
: diagonal matrix with ( th eigen value)
: eigen basis ( th column is the eigen vector for )
 (Principal axes transform) The linear transform by using
as transform matrix, is Gaussian with PDF
 (Whitening) By using / as transform matrix,
/ is also Gaussian with PDF
/

Principal Component Analysis (PCA)
 Principal component analysis (PCA)
 With
 PCA uses several (typically two) eigen vectors corresponding
to the largest eigen values as projection matrix.
 Let
• ( , ) be the two largest eigen values
• ( , ) be the corresponding eigen vectors
 We use as transform matrix
 The distribution of can be easily visualized in a low
dimensional (e.g., 2D) space.
 If
𝑪
, contains most of the information on , i.e.,

Data (distribution) visualization
 Pairwise scatter plot is NOT affordable for large M
M = 4 M = 64 (showing only 10 features)

Pair-wise scatter plots of Iris dataset
(3 classes, 4 dimensional feature)
2 dimensional projection provides
better representation of clusters
and similarity between feature

Pair-wise scatter plots of Digits dataset
(10 classes, 64 dimensional feature)
Showing only first 10x10
2 dimensional projection provides
better representation of clusters
and similarity between feature

 Homework: 3.1~3.6
 Practice:
ML_practice0_ch3_data_visualization_190817c.ipynb

Appendix A: Optimization I

Appendix: Optimization
 Topics
1. Optimization I: Unconstrained optimization
 Definition of optimization problem
 Quadratic programming problem
 Maximum likelihood estimation as an optimization problem
2. Optimization II: Iterative solutions
 Gradient descent and stochastic gradient descent
 Coordinate descent
 Newton-Raphson method
3. Optimization III: Constrained optimization
 Definition
 Lagrange multiplier and Rayleigh quotient optimization
 Duality in constrained optimization and KKT condition

Unconstrained optimization
 Definitions of unconstrained optimization
 Minimization:
𝜽∈ℝ
or ∗
𝜽
 Minimization:
𝜽∈ℝ
or ∗
𝜽
where is a cost/objective function.
 Convex optimization
 If is a convex function, the solution can be obtained by
solving (as there is only one minimum (maximum)
𝜽
where 𝜽 is the gradient operator

Unconstrained optimization: QP problem
 Quadratic programming (QP) problem
 QP problem is a special case of convex optimization problem
 is a quadratic function of , i.e.,
𝜽
 Since is a convex function, the solution is given by solving
 Solution:
∗
𝜽
(if is invertible)

Unconstrained optimization: Gradient formula
 Gradient operators
 For vector : 𝜽 𝜽
 For matrix : 𝑨 𝑨
 Gradient formula
 𝜽 𝜽
 𝜽
 𝑨
 𝑨
𝟏

Unconstrained optimization: Gradient formula
 Example (문제 A.1):
 minimize , i.e., find ∗ ∗
that minimize
and find also the minimum value ∗ ∗
 Express in vector-matrix form, i.e.
 Use the vector-matrix form to minimize
(use the gradient formula)
 Repeat for

Maximum likelihood estimation
 Given
 Data samples:
 PDF model: with unknown parameter
 We want to find that maximize
 likelihood of :
 Or log-likelihood:
 It is a maximization problem
∗
𝜽∈ℝ 𝜽∈ℝ

MLE example: Bernoulli trial
 Given
 Data samples: , where
 PDF model: ( )
with
 Parameter to estimate:
 Likelihood function
 Solution:
∗
Try to verify this by maximizing
the likelihood or log-likelihood function.
where k is the number of 1’s
occurred in N trials

MLE example: Multi-variate Gaussian PDF (선택)
 Given
 Data samples: , where
 PDF model:
where : mean, : covariance matrix  parameters to estimate
 Log-Likelihood function
with 𝑻
 Solution:

𝟏
𝑵
𝑵

𝟏
𝑵
𝑻 𝟏
𝑵
Try to verify this using
gradient formula.

Ch.4: Regression

Roadmap
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1

Ch.4 Regression
 Topics
1. Linear regression
2. Vector-matrix representation of linear regression
3. Linear prediction
4. Non-linear regression and overfit
5. Performance evaluation: cross-validation

Regression
 Elements of regression problem
solution
Data: 𝒊 𝟏
𝑵
Cost/loss:
𝜽

Regression: Linear regression
 A simple example of linear regression
 Data: where
 Model: where parameter
 Problem is to find the best for given
 Best in what sense ?
x
y
(xi, yi)

 Least squares solution (최소제곱법)
 We want to minimize the residual sum of squares (RSS)
 Define error:
 Minimize:
;𝜽
 where, is a quadratic (convex) function of and
 Can use 𝜽 to find and in terms of

 Generalization to multi-variate data
 Data: where ,
 Model:
where parameter
 Cost function: Residual sum of squares (RSS)
 where

;𝜽
 Problem is to find ∗
𝜽∈ℝ

Regression: Model structure
 Model and its training at a glance

 Solution
 is a quadratic function of ’s (convex function)
 Can use 𝜽 to obtain a system of equations
 Then, solve the system of equations to get ∗
:
:
 Equivalently, in vector-matrix form, 𝑿 𝑿 𝑿 𝒚
where
𝑻
, 𝑿 𝑿 , 𝑿 𝒚

Regression: Vector matrix notation
 Vector-matrix notation
 Data: where ,
 Model:
where ,
 Error vector:

 𝑿 𝑿 𝑿 𝒚
where 𝑿 = 𝟏 𝑻
𝑿
=
1 1
𝑥 𝑥
… 1
⋯ 𝑥
⋮ ⋮
𝑥 𝑥
⋱ ⋮
⋯ 𝑥

Regression: Vector matrix notation
 Vector-matrix notation
 Problem is to find the solution of 𝜽 , which is
𝜽 𝑿 𝑿 𝑿 𝒚
𝑿 𝑿 𝑿 𝒚
 Solution: ∗
𝑿 𝑿 𝑿 𝒚
 Unique solution exists only if 𝑿 𝑿 is invertible!

Regression: Linear regression example
 Example
 We want to estimate English score using two models
영어점수 수학점수
영어점수 수학점수 과학점수
 Find ( , ) and ( , , ), respectively. (you may use the
results in 문제 2.2)
 Homework: finish problem 4.1 and 4.2
sid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ave. SD.
English 77 81 74 89 78 77 75 84 79 82 69 76 74 67 74 71 67 67 70 69 75 6.09
Math 72 67 74 64 71 67 72 68 75 70 83 78 76 82 80 77 83 84 80 80 75 6.09
Science 75 68 72 68 74 68 75 73 79 72 78 75 72 74 77 70 75 79 77 76 74 3.47

Regression: Linear prediction
 Linear prediction
 Given time series data
 Use p previous samples to predict the next sample, i.e., we
want to predict using ( )
 Model: ( )
 Example 4.3
𝑡 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
𝑥 -4 -3 14 8 1 -5 -7 -4 -2 6 10 22 15 -15 -20 ?

 Linear prediction
 Target value:
 Data matrix:
 Model: ( ) (no intercept)
:
 Solution: ∗
𝜽
𝑿𝑿
𝟏
𝑿𝒚
 Prediction: ∗ ∗ ( )
 Note: 𝑿𝑿 is a Toeplitz matrix

 Homework: Example 4.3
1) 예측 차수 에 대해 와 를 나타내고 𝑿𝑿와 𝑿𝒚를 구하라.
에 대해 선형 예측기 파라미터 ∗
를 구하고 를 예측해 보아라.
3) 평균 제곱 오차 ( )
를 구하라. (N=14)
에 대해 (1)~(3)을 반복하라.
5) 시계열 데이터의 분산 를 구하고 (여기서
), 에 대해 를 구하라.
6) (5)의 결과에 대해 간략히 비교 설명하라.
𝑡 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
𝑥 -4 -3 14 8 1 -5 -7 -4 -2 6 10 22 15 -15 -20 ?

Regression: Non-linear model and overfit
 Example of Non-linear regression
 Two-feature data
 Non-linear model:
where
 Defining , RSS cost gives us ∗
𝑿 𝑿 𝑿 𝒚
 Note
 The model is non-linear in ’s, but linear in ’s
 RSS cost function gives us a linear system of equations

 Considerations for non-linear regression
 If the model is non-linear function of ’s, the problem
(finding solution) become complicated.
 Non-linear model is subject to overfit (large generalization
error), especially when the number of samples is relatively
small compared to the number of parameters in the model.
 We need to check if the model is overfitted to data or not.
출처: https://slideplayer.com/slide/6825533/

 Overfit, underfit and just(appropriate) fit
source: https://slideplayer.com/slide/6825533/
source : https://towardsdatascience.com/underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6fe4a8a49dbf

 How to check if the model is overfitted or not
 If the model is overfitted, generalization error is much (?)
larger than the minimized cost for training data, i.e.,
∗ ∗
where ∗ was obtained based on
 That’s why we divide data (samples) for training and test
for performance evaluation
 More systematic approach to test overfit: cross validation

 L-fold cross-validation (교차 검증)
1. We divide the entire data (of N samples) into L groups (of
N/L samples per group)
2. Select one group for test and use all others for training
3. Measure ∗ and ∗
4. Repeat 2 and 3 for each group and take average on both
measures
5. Check if ∗ ∗

 Homework: 4.1, 4.2, 4.3
 Computer Lab: ML_practice1_regression_ex_190820.ipynb

Ch.5: Regularization

Roadmap
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1

Ch.5 Regularization
 Topics
1. Ridge regression
2. LASSO regression
3. Elastic-net

Regularization
 Recall linear regression
 Data: where ,
 Model:
where ,
 where
 𝑿 𝑿 𝑿 𝒚
 Problem is to find the solution of 𝜽 , which is
𝜽 𝑿 𝑿 𝑿 𝒚  ∗
𝑿 𝑿 𝑿 𝒚
 Unique solution exists if 𝑿 𝑿 is invertible! What if it is NOT?

Regularization
 In what case, is NOT invertible?
 It is not if N < M, i.e., when the number of samples less
than the number of features (e.g., as in bioinformatics,
medical application)
 Infinite number of solutions exist
 The model parameters and performance can be highly
variable with a small changes in data (overfit)
 Two possible approaches
 Increasing sample size (noise injection)
 Reducing feature dimension (selecting good features)

Regularization
 Increasing sample size (noise injection)
 One can double the number of samples by generating new
set of data where is random noise matrix with
covariance , i.e., ,
 Then, use as new data
 Note that 𝑿 𝑿 𝑿 𝑿 , which is now
invertible “anyway” if 2N > M
 It is effectively a “noise injection”
 generalization error can be reduced to some extent
 If needed, one can add more with different random noise.
 The noise variance must be chosen carefully.
 Note: the distribution of may not model well the
true distribution of .

Regularization
 Reducing feature dimension (selecting features)
 One can select M’ (<N) features, for example, having
highest covariance with target value y.
 However, this does not guarantee a better performance.
 An efficient feature selection method (LASSO) will be
discussed shortly

Regularization: Ridge and LASSO
 Ridge and LASSO regression: RSS + L1/L2 Penalty


 Lp-norm:
/
 controls the relative weight between RSS and penalty
 Elastic net: RSS + L1 + L2 Penalty


Regularization: What is the impact of penalty?
 Ridge regression
 Ridge regression is simply a QP problem
 And the solution is ∗
𝑿 𝑿 𝑿 𝒚
 𝑿 𝑿 is invertible with , even if 𝑿 𝑿 is not. (문제6.3 참고)
 It is effectively a “noise injection” (an increase of sample size)
 and generalization error can be reduced to some extent

 LASSO regression
 LASSO stands for Least
Absolute Shrinkage and
Selection Operator
 It tends to select features
that describe well the
target value, y
 some ’s vanish if the
corresponding features
doesn’t have strong
correlation to y
 LASSO effectively reduce M,
rather than to increase N.

 Further remarks on LASSO regression
 controls sparsity (high selects less features)
 LASSO tends to select one feature from a group of highly
correlated variables (features) and ignore the rest.
 Unlike , L1-penaty is not differentiable at
 LASSO regression is convex optimization problem, while it is
NOT a simple QP problem
 use iterative algorithm to find the solution, especially when M>N
(Coordinate descent algorithm to be discussed next)
 See textbook for coordinate descent algorithm for LASSO

Regularization: Elastic-net
 Elastic-net
 Elastic-net combines L1 and L2 penalty
 L1-penalty selects features (generating sparse model)
 L2-penalty reduces generalization error and also encourage
grouping effects.
 Computer lab: ML_practice1_regression_ex_190820.ipynb

Appendix C: Optimization III

Appendix: Optimization
 Topics
1. Optimization I: Unconstrained optimization
 Definition of optimization problem
 Quadratic programming problem
 Maximum likelihood estimation as an optimization problem
2. Optimization II: Iterative solutions
 Gradient descent and stochastic gradient descent
 Coordinate descent
 Newton-Raphson method
3. Optimization III: Constrained optimization
 Definition
 Lagrange multiplier and Rayleigh quotient optimization
 Duality in constrained optimization and KKT condition

Unconstrained optimization
 Definitions of unconstrained optimization
 Minimization:
𝜽∈ℝ
or ∗
𝜽
 Minimization:
𝜽∈ℝ
or ∗
𝜽
where is a cost/objective function.
 Convex optimization
 If is a convex function, the solution can be obtained by
solving (as there is only one minimum (maximum)
𝜽
 Sometimes, however, one cannot get closed form solution.
 What can we do, then ?

Iterative search for minimum/maximum
 One idea: gradient search
 Gradient descent
 Hill climbing
 Steps
 Given cost function J()
 Initialize n = 0, (n) = 0
 Loop (epoch):
1. Compute gradient at the current
position, 𝜽 𝜽 𝜽( )
2. Update param., ( ) ( )
3. n  n+1
4. Repeat 1~3 until convergence
 𝜂: Learning rate, 0 < 𝜂 ≪ 1
 Small enough 𝜂 ensures that
 Large 𝜂: Fast convergence, but
high MSE due to bouncing
 Small 𝜂: Slow convergence, while
lower MSE
𝐽 𝜽 ≥ 𝐽 𝜽

 Stochastic gradient descent (SDG)
 Cost is typically a sum of per-sample cost
 Update for every sample
 Steps
 Initialize  = 0
 Outer Loop (epoch): for n = 1,2,…
• Inner loop: for i = 1,2,…,N (number of samples)
( ) ( )
𝜽
• Repeat inner loop until convergence

 In linear regression

 𝜽 (gradient of per-sample cost)
 SGD for linear regression
 Initialize  = 0, n=0
 Outer Loop (epoch): for n = 1,2,…
• Inner loop: for i = 1,2,…,N (number of samples)
𝑒
( )
= 𝑦 − 𝒙 𝜽( )
( ) ( ) ( )
• Repeat inner loop until convergence

 Using momentum
 In SGD, if each sample contains “noise”, it disturbs the
algorithm, i.e., parameter may move to incorrect direction
 It can be alleviated using momentum
( ) ( )
𝜽
( ) ( ) ( )
where

 Coordinate descent
 Rather than to update every parameters at a time
 Update parameters one by one (one coordinate at a time)
𝜃 = argmin 𝐽 𝜃 , 𝜃 , … , 𝜃 , 𝜃 , 𝜃 , … , 𝜃
 is given by the solution of equation
𝐽(𝜽)
𝜽𝒌 [ , ,…, , ,…, ]
= 0
 Simpler implementation
𝜃 = argmin 𝐽 𝜃 , 𝜃 , … , 𝜃 , 𝜃 , 𝜃 , … , 𝜃

 Coordinate descent for linear regression
 Cost: 𝟐


 With , and , , we have
, , ,
 Update rule:
( )
,
,
( )
,
 Homework: C.1

Ch.6: Classification

Ch.6 Classification: problem formulation
 Topics
1. Bayesian approach
2. Bayesian approach under Gaussian assumption
 Decision boundary
3. Linear model as a special case

Classification: Problem formulation
 Data
 Data: where ,
 where is a set of categories(classes)
 ’s are categorical and discrete
 Bayesian approach: probabilistic model
 Assume each class (kth class) is distributed ~ p(x|Hk).
 Given new data x, decide its class y as
∈
 i.e., select class index for which the conditional
probability of x is maximum

Classification: Bayesian approach
 Binary classification
 Assume binary classification (for simplicity), i.e.,
 Given new data x, decide its class y by comparing log-
likelihood
 Binary classification under Gaussian assumption
 Assume with parameter and .
 Then, we have
𝑻 𝑻

Classification: Bayesian approach
 Binary classification under Gaussian assumption
 Suppose that . Then, we have
𝑻 𝑻
 i.e., compare (Mahalanobis) distances of x from class centers
𝑝 𝒙|𝐻 𝑝 𝒙|𝐻

Classification: Decision boundary
 Decision boundary
 It is a “surface” where , i.e.,
𝑻 𝑻
 It can be written as
𝑻 𝑻
 where
(a vector)
(a scalar)
 The decision boundary is given by “conic section”
 which can be an hyperbola, an ellipse or a (hyper) plane

Classification: Linear model
 Linear model for binary classification
 Suppose further that .
 Then, the decision boundary becomes
𝑻 𝑻 𝑻
 which is a (hyper) plane
 And the decision rule becomes
𝑻
or equivalently, 𝑻
 Model parameter: and (intercept)
 Linear classifier partitions ( ? ) into
non-overlapping areas using ( ? )

Classification: Linear model vs. Bayesian approach
 Bayesian classifier versus linear classifier
𝑝 𝒙|𝐻 𝑝 𝒙|𝐻
𝜽 𝑻
𝒙 + 𝜃 = 0

Classification: Summary
 Binary classification: summary
 Bayesian approach
 Under Gaussian assumption (with )
𝑻 𝑻
 With , we get linear model
𝑻
or equivalently, 𝑻
Our main focus is
on this linear model

Classification: Naive implementation
 Naive implementation
 Given data: where ,
 is a set of categories (classes)
 ’s are categorical and discrete, e.g.,
 Divide data into and (for each class)
 Compute for
 Use ’s for classification
 This is not our focus, though.

Classification: Roadmap
 Based on the model,
 Ch.7: We will develop training (learning) rule, where we
obtain and directly from data by solving an
optimization problem
 Ch.8: The linear model will be extended for multinomial
classification problem
 Ch.9: The model will be further extended to get neural
network model
:

 Homework: 5.1, 5.2, 5.3

Ch.7: Logistic Regression
(binary classification)

Roadmap
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1

Ch.7 Logistic Regression for binary classification
 Topics
1. Logistic regression:
 Model with logistic sigmoid function
2. Parameter optimization:
 Likelihood function as an objective function
 Application of gradient search algorithm
3. Performance measures of binary classifier
 Confusion matrix, True Positive and False negative
 Accuracy, Sensitivity, Specificity
 ROC and AUC

Logistic regression: Model
 Recall (generalized) linear model for binary
classification,
 It is a linear regressor if
 It is a linear classifier if
 It is a logistic regressor if
( )

Logistic regression: Model
 Interpretation of logistic regression model
, where

 can be regarded as , so that
( 𝜽 𝑻 𝒙)
( 𝜽 𝑻 𝒙) (𝜽 𝑻 𝒙)
 ( )
( 𝜽 𝑻 𝒙) (( )𝜽 𝑻 𝒙)
where
 can also be interpreted as “class estimate”.
 In both case, if , is likely to be class 1. Otherwise
class 0.
 𝑻
is called “odds” of being class 1. (Note: 𝑻
).

Logistic regression
 Geometrical
interpretation
Decision boundary
𝑻
Decision variable
𝑧 = 𝜽 𝑻
𝒙 + 𝜃
(odd of 𝒙 belonging to class 1)
Class 1
Class 0

Logistic regression: Cost function
 Cost function: Negative log-likelihood
 𝑻
can be interpreted as probability (likelihood)
that belongs to class 1.
 Likelihood that belongs to the target class is
given by
 Log-likelihood as an “objective” to maximize
 Can also be formulated as minimization of

Logistic regression
 Elements of regression/classification problem
solution
Data: 𝒊 𝟏
𝑵
Model with parameters: 𝑻
( 𝜽 𝑻 𝒙 )
Cost/loss:
Algorithm to min/maximize: Gradient descent

Logistic regression: Optimization
 Optimization
 contains non-linear function ( )
.

𝜽
isn’t a simple QP problem.
 We resort gradient search to get optimal (or a good) solution.
 To perform gradient search, we need gradient of the cost,
which is given by (see textbook p.68)
𝜽
 Algorithm (pseudo code)
 Initialize ( )
 ( ) ( ) ( )
for .
“+” means hill-climbing

Logistic regression: Another cost function
 Another cost: Residual sum of square (RSS)
 𝑻
can also be interpreted as class estimate.
 Define estimation error:
 RSS as a cost to minimize
 Gradient (see textbook p.68)
𝜽
 Gradient descent
( ) ( ) ( )
for .
 What’s difference from likelihood based optimization?
“-” means gradient descent

Performance measures of binary classifier
 Confusion matrix




 Why do we need other
measures than accuracy?
 In some application, FN (FP)
causes more serious problem
than FP (FN)
 E.g., in medical application, you
want to make decision if a
person has tumor (P) or not (N).
It isn’t a big problem if a normal
person (without tumor) is
decided to have tumor (FP). But,
the opposite case (a person with
tumor decided as normal, FN)
may cause serious problem.
 You may want to minimize FPR
requiring TPR no less than a
certain threshold.

Performance measures of binary classifier
 ROC and AUC
 ROC: Receiver operating characteristic
 AUC: Area under (the ROC) curve
1
1
0 FPR
= FP/(FP+TN)
TPR
=TP/(TP+FN)
AUC (면적)
결정 경계 에
따른 성능
변화
Positive(1)
Negative(0)
TP, FP down
TN, FN up
TP, FP up
TN, FN down

 Practice: ML_practice2_classification_ex_190820.ipynb

Ch.8: Multi-task regression
and multinomial classification

Roadmap
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1

Ch.8 Multiclass classification
 Topics
1. Multi-task regression
2. multinomial classification
3. Generalized linear model

Multi-task linear regression
 Linear regression with vector target
 Data: where ,
where : KxN matrix with each column being
 Linear model: 𝑻
where : Kx(M+1) matrix (including intercept)
 Define
 ( ): th row of . ( : th column of )
 : th column of
 Cost function (RSS)
𝑻
( ) ( )

 Cost function is a sum of RSS for each target value ( )
( )
 Optimization can be performed separately for each target
value, i.e.,
𝚯 𝜽
( )
 where 𝜽
( ) gives 𝑿 𝑿 𝑿 𝒚( )
 And 𝚯 gives 𝑿 𝑿 𝑿 𝒀
 Can be implemented using K parallel linear regressors with
scalar target value

 Can be implemented using K parallel linear regressors with
scalar target value
 Alternative expression of cost function
𝑻 𝑻

Multinomial classification: two approaches
 multinomial classification can be implemented using
multiple binary classifiers.
 Two approaches (K class case)
 One against the rest:
 we use K binary classifiers, one for each class.
 Each classifier (kth classifier) compute, for example, the
likelihood
( )
of input x belonging to the kth class.
 Decide having the highest likelihood
 Pairwise binary classification + majority voting:
 we use K(K-1)/2 binary classifiers for each pair of classes.
 Decide class by taking majority of the winners.

Multinomial logistic regression
 Data
 Data: where ,
 where is a set of categories (classes)
 ’s are categorical and discrete
 Considerations
 (single-task) logistic regressor using (integer) as target
value will not work well (because ’s are categorical, while
single-task regressor regards ’s as numerical.)
 One approach is to encode ’s to a binary vector (of size
Kx1) and use multi-task logistic regressor

 Model
 Softmax function on top of multi-task linear regressor
 Multi-task linear regressor
for
(odds of belonging to class )
Or, collectively,
 softmax function
∑
(likelihood of belonging to class )
 Note that and

 Cost/objective
 can be interpreted as Pr{ belongs to class }
 Log-likelihood can be used as the objective to maximize.
 Gradient:
𝜽
where
 Gradient search:
( ) ( )
𝜣 𝜣 𝜣( ) for .
Since 0 ≤ 𝑆 (𝜣 𝒙 ) ≤ 1,
the direction of gradient is
either 𝒙 for 𝑘 = 𝑦 or −𝒙 for 𝑘 ≠ 𝑦

Multinomial logistic regression: more issues
 One hot encoding
 One hot encoding is a mapping of an integer
to a binary vector .., such that ,
i.e., only one element of is 1 and all others are 0.
 Example: ,
 By encoding all the target values , , .., , ,
we have
 is a KxN matrix with each column being
 Then, the gradient is given by
𝜽 ,

Multinomial logistic regression : more issues
 Cross-entropy
 With one hot encoding: , , .., ,
 is the probability mass of
 Posterior likelihood of : , , ,
with ,
 The cross entropy between and is given by
,
 We call “cross-entropy cost”.

Multinomial logistic regression : more issues
 Multi-task logistic regressor
 Using one hot encoding, one can
replace (for simplicity) the softmax
function with K separate logistic
sigmoid function
 K parallel logistic regressors.
 Performance ?
s(o1) s(oK)s(o2)
1
2
K
x0 x1 x2 xM
o1 o2 oK
̂p1 ̂p2 ̂pK
 Other remarks
 Multinomial logistic regression is one-against the rest
approach.
 Once the likelihoods ’s are obtained, the class estimate is
determined by

Multinomial logistic regression: generalization
 Generalized linear model
 Linear regression and logistic
regressions can be represented by
one structure
 Consisting of an “activation
function” on top of multi-
task linear regressor
 The output can be interpreted
in various ways (e.g., as likelihoods
or as estimates of target value)
 Also, there are many options for activation function (e.g.,
linear, sigmoid or tanh)
 If input is categorical, apply one hot encoding before
feed to regressor (input dimension must be changed too)

Multinomial logistic regression: generalization
 Generalized linear model
 Regularization can also be applied if desired by defining
cost with penalty
𝐅
𝟐
 where
 For linear regression:
 For logistic regression:
 basically regards the input and output as numerical. So,
if you deal with categorical values, you need apply one
hot encoding first.

 실습: ML_practice2_classification_ex_190820.ipynb

Ch.9: Artificial neural network

Ch.9 Artificial neural network
 Topics
1. Perceptron and artificial neural network (NN)
2. Neural network model
3. Training NN: backpropagation
4. Some issues on NN
 Convergence to local minima
 Overfitting
 Vanishing gradient problem
5. Practical considerations (building and training NN)

Roadmap
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1

ANN: Perceptron
 Perceptron
 It is an array of neurons interconnected, exactly the
same as in generalized linear models
 It was suggested mimicking biological neuron
Biological neuron
source: https://en.wikipedia.org/wiki/Biological_neuron_model
Regression model
(Artificial neuron)
f(net)
Neuron
Input nodes
(dendrites)
Output nodes
(axon terminal)
Activation
function
(synaptic)
Weights
x0 x1 x2 xM
0 1 2 M
net T
x
y

ANN: Perceptron
 Perceptron
 Multi-task regression model is an horizontal array of
artificial neurons, with either combined activation or
separate activation
̂y1
f(o1, o2,…, oK)
̂y2 ̂yK
1
2
K
x0 x1 x2 xM
o1 o2 oK
with combined activation
s(o1) s(oK)s(o2)
1
2
K
x0 x1 x2 xM
o1 o2 oK
̂p1 ̂p2 ̂pK
with separate activation

ANN: Multi-layer Perceptron
 Multi-layer Perceptron
 Consists of multiple layers of
multi-task regressors vertically
stacked
 Output of one layer is fed to the
input of the next layer.
 Number of layers and number of
neurons per layer can be
arbitrarily set
 Non-linear activation function
make it different from single-
layer (linear) model, i.e., it makes
the model non-linear
 Can be used for regression and
classification

 Operations
 Feedforward (prediction phase):
For given input and the
current parameter , it
produce an output
 Feedback (training phase): For
each input and target vector
, the parameter ’s are
updated
 Gradient search is used for
some optimality criteria

 Structure definition
 Number of layers:
 Number of neurons per layer:
 Full connection assumed
 Signals and parameters
 Input:
 Target vector:
 Weight matrix: ( )
 Hidden layer output:
( )
 Final output: ( )

 Feedforward (prediction)
From to
1) ( ) ( ) ( )
( ( )
)
2) ( ) ( ) ( ( ) )
 More simply, ( ) ( ) ( )
 ( ) is 1-augmented version of ( )
 ( ) is matrix
including “intercept”
 activation function is applied
to each element of ( )

 Feedback (training)
 Assume training is performed in per-sample basis, i.e., SGD
 Cost function (RSS):
( , ,…, ) ( ) ( ) ( )
 Cross-entropy can also be used as cost (not covered here)
 To train the model, we need 𝑾( ) for
 Top layer is easy: 𝑾( ) ( ) ( )
( )
( ) ,
where ( )
( ) ( )
and
( )
( )
( )
 Layer below ? We need to apply chain rule
 The problem, however, is not as simple as you expect.
See textbook, section 9.3

 Feedback (training)
 The training starts from top layer and run through
downward, one by one.
 Training: From to :
( ) ( ) ( )
with ( )
𝑾( )
( , ,…, )
where, by applying chain rule (see textbook p.81-82)
( ) ( ) ( )
( ) ( ) ( )
( )
 We call it “backpropagation (BP)” as it is performed
backward (downward), opposite to feedforward operation.

 Back-propagation (BP) algorithm
 From to : ( ) ( ) ( )
where
∆𝑤
( )
= δ
( )
z
( )
δ
( )
= 𝑓′ 𝑎
( )
𝑦 − z
( )
δ
( )
= 𝑓′ 𝑎 ∑ 𝑤 δ
Vector-matrix form
∆𝑾( )
= 𝛅 𝐳
𝛅 = 𝑓′ 𝒂( )
⨀ 𝒚 − 𝒛( )
𝛅 = 𝑓′ 𝒂( )
⨀ 𝑾( )
𝛅( )
⨀: element-wise product

 Activation function
 Except for the top (output) layer, activation function
should be non-linear for a hidden layer to be effective.
 Any monotonically increasing function can be used.
 They are typically s-shaped, e.g., logistic sigmoid or tanh
 ReLU or leaky ReLU are widely used recently.
ReLU:
Leaky ReLU: with

ANN Issues: Convergence to local minima
 Convergence to local minima
 NN is a non-linear model and the cost J is not convex.
 The number of minima/maxima is not known
 Gradient search does not guarantee the convergence to
the global minimum
 The local minima we get depends on the initial setting of W
 There are no systematic approaches to achieve global
minimum yet.
 Simulated annealing, Genetic algorithms were proposed as
heuristic solutions

ANN Issues: Overfitting
 Overfitting
 NN model has so many parameters (W(1),W(2),…,W(L))
 Deep NN is especially the case
 Similar to linear model, where N << M, NN with too much
parameters may be easily overfitted to the training data
 Three approaches to relieve overfitting
 Noise injection: Increasing the number of data by adding
noise  reduce generalization error (to some extent)
 Regularization technique: add L1/L2 penalty to the cost
function  similar impact to noise injection
 Dropout ?

ANN Issues: Overfitting
 Dropout: avoiding co-adaptation of neurons
 Useful for Convolutional NN (for image)
 At each training phase (for a batch of samples), we
randomly select a portion of neurons (with probability p)
and disable them
 Can avoid many neurons co-adapted to each other (avoid
many neurons activated to similar data)
 Many NN packages support dropout layer as an option

ANN Issues: Vanishing gradient
 Vanishing gradient problem
 This is also a typical problem in deep neural network.
 BP (training) starts from top layer and run through
downward one-by-one, recursively.
 Recall: ( )
, where ( ) ( ) ( )
where
 With sigmoid function, (it’s mostly close to 0)
 ’s are computed recursively
 As BP run through downward, gets smaller and smaller,
and so does ( )
 vanishing gradient
 If NN has many layers, effective learning rate in bottom
layers gets very small, i.e., neurons in bottom layers are
hardly trained  take to much time to be trained

ANN Issues: Vanishing gradient
 Vanishing gradient problem
 Using ReLU or leaky ReLU may help alleviate vanishing
gradient problem.
 Unsupervised learning based pre-training of bottom layers
was proposed, though not so widely used recently.

ANN Issues: Building NN model
 To build a neural network model, you need to
consider first
 Input and output dimension?
 How many layers? ( )
 How many neurons for each layer? ( )
 Activation function ? (sigmoid, tanh, ReLU or leaky ReLU)
 Dropout layer? With what probability? (p)
 What cost function ? (RSS or cross-entropy)
 Which optimizer to use? (simple SGD w/wo momentum .. )
 Batch size?
 Regression or classification ? (For regression, top layer
activation is typically set linear)

ANN Issues: Training NN model
 When training NN, you need to check
 Overfitting (compare performance with training and test
data while training the model)
 Vanishing gradient (check if training takes too much time)
 Convergence to bad local minima (you can train many times
or train multiple instances in parallel with different initial
values)
Computer Lab.
 Practice: ML_practice3_NN_ex.ipynb

Ch.10: Recurrent neural network (RNN)

Ch.10 Recurrent neural network
 Topics
1. Model structure and operation.
2. RNN Training: backpropagation through time (BPTT)
3. LSTM (long/short term memory)

Roadmap
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1

RNN: Recurrent neural network
 Features
 Recurrence means output fed
back to input
 Necessarily, the input is a
time-series data
 Example on the right consist of
two layers
 The hidden layer output is fed
back to input with one sample
delay (D)
 Layer 2 has no feedback loop (conventional NN layer)
 Main applications are speech recognition, language
modelling (machine translation, sentence completion),
where data is given as time series
𝒉( )
= 𝑓 𝑼𝒙( )
+ 𝑽𝒉( )
𝒚( )
= 𝑓 𝑾𝒉( )

 Model
 Consider 1-layer RNN for simplicity
 Input: ( )
(time-series)
 Output (state): ( )
(time-series)
 Feedforward operation:
( ) ( ) ( )
 Output depends on both ( ) and previous output (state) ( )
 Feedforward operation can also be expressed as
( ) ( ) ( )
 Initial condition: Asuume ( )
 ( ) ( )
f(·)
h(t)
x(t)
h(t-1)
U
V
(a) RNN with a loop
D
g(t)

 Unfolded model
f(·)
h(t)
x(t)
h(t-1)
U
V
(a) RNN with a loop
D
g(t)

RNN: Training
 RNN Training (textbook 10.2)
 Cost function:
( ) ( ) ( ) ( ( ): target vector)
 Gradient can be obtained by applying chain rule.
 Gradient w.r.t. (at time )
𝑽
( )
𝑽
where
( )
𝑽
( )
𝒈( )
𝒈( )
𝒈( )
𝒈( )
𝑽
𝒈( )
𝒈( )
𝒈( )
𝒈( )
𝒈( )
𝒈( )
( )
𝒈( )
𝒈( )
𝑽
 With
𝒈( )
𝒈( )
( )
and
( )
𝒈( )
𝒈( )
𝑽
( ) ( )
,
( )
𝑽
( ) ( ) ( )

RNN: Training
 RNN Training (textbook 10.2)
 Gradient of (at time )
𝑼
( )
𝑼
where
( )
𝑼
( )
𝒈( )
𝒈( )
𝒈( )
𝒈( )
𝑼
 In the same way as for the gradient w.r.t. , we have
( )
𝑼
( ) ( ) ( )
 To update and , we need perform BP through time
(from to ).
 We call it backpropagation through time (BPTT).

RNN: Training
 Vanishing and exploding gradient
 Looking at
𝑼
(and also
𝑽
), the gradient contains
( )
 where for any activation function we considered,
( ) (matrix norm)
 Assuming , we have
( )
(mostly , why?)
 As , l.h.s. goes to 0 if (vanishing gradient) or
to  if ( )
/( )
(exploding gradient)
 The latter seldom occurs.

RNN: Training
 Forgets past inputs/outputs quickly
 We also have for
𝑼
( ) ( ) ( ) ( ) ( )
 RNN is supposed to memorize the past inputs (in the system
state) to deal with time-series data.
 With , however, as gets large.
 This means the system forgets past inputs quickly.
 There are many examples where we need long term memory
to correctly catch what exactly the sentence means.

RNN: Training
 RNN summary
 Due to recurrence nature, RNN training requires
backpropagation through time (to t=1)
 If T gets large, the gradient may vanish or explode.  the
training rule should be carefully tuned.
 As in most case, vanishing gradient occurs more
frequently than exploding gradient
 One solution to avoid vanishing/exploding gradient problem
is to perform BPTT only for finite length of time window.
(unfolded model of finite length)

RNN: LSTM
 Long/Short term memory (LSTM)
 a variant of RNN (proposed in 1997) to solve (partly) the
vanishing gradient and to make system memory longer.
 Vanilla RNN and LSTM
 3 gates (forgetting/input/output gate) + main path
 Two separate states: ( )
and (𝒕)

RNN: LSTM
 LSTM operation
 Gating function:
( ) ( ) ( ) (forget gate)
( ) ( ) ( )
(input gate)
( ) ( ) ( ) (output gate)
 Cell state update:
(𝒕) ( ) ( ) ( ) ( ) ( ) (long-term memory)
( ) ( ) ( )
(short-term memory)

RNN: LSTM
 LSTM operation
 Cell state update:
(𝒕) ( ) ( ) ( ) ( ) ( ) (long-term memory)
( ) ( ) ( )
(short-term memory, final output)
 When ignoring the gating function, ( )
is simply a sum of ( )
and
the new input ( ) ( )  can keep long-term memory

( )
select important features from previous state ( ),
which comprise a part of the current output ( )
.

( )
select important features from the new input (output of
vanilla RNN), which comprise another part of the current
output)

( )
controls what features in ( )
to pass to output ( )
.

RNN: LSTM
 LSTM operation
 Gating function:
( ) ( ) ( )
(forget gate)
( ) ( ) ( ) (input gate)
( ) ( ) ( ) (output gate)
 Parameters of three gates ( , , , , , ) are
obtained through BPTT, too.
 i.e., LSTM learns from the data what features to select
from ( )
(long-term memory) and from ( ) ( )
.
 Also it learns what features in ( ) to pass to the final
output ( )
.

RNN: Building RNN/LSTM model
 Unfolded RNN/LSTM model
 You can add NN layer on top of RNN/LSTM cell.
RNN/LSTM
cell K
RNN/LSTM
cell K-1
RNN/LSTM
cell 2
RNN/LSTM
cell 1
y(t)
y(t-1)
y(t-K+1)
y(t-K)
x(t)
x(t-1)
x(t-K+1)
x(t-K)
DD D
Computer Lab.
 Practice 1: ML_practice4_RNN_seq_pred.ipynb
 Practice 2: ML_practice5_RNN_hihello.ipynb

Ch.11: Convolutional neural network (CNN)

Ch.11 Convolutional neural network
 Topics
1. Features of CNN
2. CNN Model
 Convolution sublayer
 Activation function sublayer
 Pooling sublayer
3. CNN Training

Roadmap
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1

CNN: Convolutional neural network
 Image/Vision classification and object detection
 An image has 2D(matrix) or 3D(tensor) structure (i.e., RGB)
 Information is contained in a pixel, an element of a matrix
(2D image) or a tensor (2D images for RGB or 2D images
captured with 2 cameras).
 Nearby pixels (values) are highly correlated
 patterns in an image can be identified by the correlations
between nearby pixels
 nearby pixels must be processed as a chunk
 Identifying patterns in an image is “translation invariant”
and “size invariant”. (we can identify same patterns
wherever it is located and whatever its size is).
 Sometimes, it should also be rotation invariant.

CNN: Convolutional neural network
 CNN for image/vision data
 CNN is a special NN designed for image/vision data.
 Can be used for image classification, object detection,
depth estimation, etc.
 It processes a chunk of nearby pixels simultaneously
(receptive field)
 Will see how it provides object (pattern) detection with
translation invariance.
 Size invariance can be provided by multi-layer structure
 Rotation invariance?

CNN Model
 (example) configuration of CNN
 Two convolution NN layers and 3 (FC) NN layers.
 Convolution NN layers are divided into sublayers:
convolution sublayer (denoted by CX) and pooling sublayer
(denoted by SX)
 (FC) NNs layers are C5, F6 and output (C5 looks like FC NN)
source: Proc. of IEEE, Nov. 1998 by Y. LeCun, et.al.

CNN Model – Convolution layer
 CNN model (convolution layer)
 For convenience, we divide it into 3 sublayers.
 Convolution sublayer
 Activation function sublayer
 Pooling sublayer
 Activation function sublayer is the same as in conventional NN
 Dropout can also be applied as in fully connected NN layer

 Conventional NN layer has 1-dimensional array of neurons,
while CNN layer has 3-dimensional array (width, height and
depth), where depth index is called “channel”
 Input to a CNN layer is also 3-dimensional, e.g., 2D images
with RGB (3 channels)
 Denote 3-d input and output of th CNN layer as ( )
and
( )
, where and are channel index.

 Operations of three sublayers are
 Convolution sublayer ------------:
( ) ( ) ( )( )
 Activation function sublayer ---:
( ) ( )
 Pooling sublayer -----------------:
( ) ( )
 Input and output size are the same only for AF sublayer.
Other two sublayer has different input and output size.

 Convolution sublayer

( ) ( ) ( )( )

( )
is weight matrix (filter) between th channel of the
input and th channel of the output.
 is 2-d convolution, with which th element of ( )
is
given by ,
( )
,
( )
,
( )
( , )∈ ( , )
 is the “receptive field” of th neuron
2-d array of neurons
of jth output channel
2-d array of input signal
of ith input channel

 Each filter responses to a certain pattern within a
receptive fields on the input.
 Filter examples: three filters of size 5x5 response to
different patterns (diamond, T and diagonal, respectively)
 The filter coefficients are obtained through CNN training
and, in general, they are real values.

 Convolution sublayer example (2 input ch., 3 output ch.)
 All the neurons of a channel share the same weight matrices
 A channel (2D array) is a feature map containing information
of a (combination of) specific pattern(s) defined by weight
matrices; (information on location and existence)

 Configuration parameters
• : stride, ( ) : size of the weight matrix (2-d)
• ( ), ( ): size of input and output (3-d)
 , ( ) and ( ) must be set properly
 The number of weight matrices (filters) to train is
 In general, , , while ,
2-d array of neurons
of jth output channel
2-d array of input signal
of ith input channel

 Activation function sublayer

( ) ( )
 Output of convolution sublayer, ( )
, is then passed through
an activation function.
 ReLU or leaky ReLU are typically used.
 The output ( )
has the same size as that of the input.

 Pooling sublayer

( ) ( )
 Pooling sublayer down sample the sublayer input, ( )
.
 While doing so, it also summarizes the data too.
 Let be the down-sample ratio. Each channel of input is
partitioned into areas (pooling area), in which
array of numbers are summarized into a scalar.
 Two types: max-pooing (takes maximum value) and average-
pooling (take average of values)  output size is
of input size

 Pooling sublayer
 Pooling operation can be expressed as
,
( ) ( , )∈ ( , )
,
( )
,
( )
( , )∈ ( , )
 is the pooling area of th output.
 Pooling reduces computational burden, e.g., with , the
number of parameters to train is reduced by ¼.
 If is too large, however, important information can be lost.
 Better to apply pooling
multiple times with
small .

CNN training
 CNN training
 The parameters to optimize is the weight matrices, 𝑾
( )
’s
for ,
 Similar to conventional NN, we apply chain rule to compute
gradient w.r.t. ( )
 Differences from conventional NN
1. 3-D (cubic) arrays of neurons
2. partial connection & weight sharing in conv. sublayer
3. passing gradient through pooling sublayer.
 See textbook section 11.3 for detail

CNN training
 Improving performance of CNN
 Apply dropout to avoid co-adaptation between channels
 Data normalization: adjust mean (brightness) and variance
(contrast) of image to make them fall within predefined ranges
 Batch normalization: normalize data for each batch at each layer
 Data augmentation: increase data set by resizing and/or
rotating the original image  size/rotation invariance

 Practice: ML_practice6_CNN_190820.ipynb
Computer Lab.

Ch.12/13: Unsupervised learning:
Clustering and data visualization

Ch.12/13 Clustering and data visualization
 Topics
1. Clustering
 Partitioning (centroid) based clustering: k-means algorithm
 Hierarchical (connectivity based) clustering and dendrogram
 Density based clustering
 Distribution based clustering
2. EM algorithm for Gaussian Mixture Model (Ch.13)
3. Data visualization using non-linear mapping: t-SNE

 Clustering
 Data without label: where
 Objective is to divide data into a set of groups
based on some similarity measures
 Need to devise procedures to efficiently group data
 Data (distribution) visualization to check clusters
 Typical similarity measures:
 Euclidian distance:
 Correlation:
𝒙 𝒙
𝒙 𝒙

 Four approaches to clustering
 Partitioning (centroid) based clustering: k-means
 Hierarchical (connectivity based) clustering
 Density based clustering
 Distribution base clustering: Gaussian Mixture Model and
EM algorithm (ch.13)

Partitioning (centroid) based Clustering: k-means
 Partitioning (centroid) based clustering
 The feature space is
partitioned into Voronoi
regions, where each region
is represented by a
centroid.
 Based on Euclidian distance
measure, the points in a
Voronoi region are those
closest to that centroid
 k-means (Lloyd) algorithm
searches for centroids for
pre-defined number of
regions to partition.

 K-means clustering (Lloyd’s algorithm)
 Input: , : the number of clusters to find
 Initialization: randomly select samples to use them as
centroids,
1) Determine class members :
 Set
 For all samples , do
 ∗
∈{ , ,…, }
∗ ∗
2) Update centroid:
| | ∈ (mean of its members)
 Repeat 1) and 2) many times until doesn’t change any more
 Output: , cluster label for all

 Partitioning (centroid) based clustering
 K-means algorithm has been originally proposed for vector
quantization
 The clusters found can be quite different from our
expectation, especially when the size of the true clusters
are quite different
Source: https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

Hierarchical (connectivity based) clustering
 Hierarchical clustering
 cluster hierarchy represented by a dendrogram (a binary
tree representing similarity between clusters).
 In the tree, a node is a cluster and a leaf node is a sample
 Two approaches to build dendrogram: top-down (divisive) or
bottom-up (agglomerative)
Node
Leaf node
Root node
Samples, labelled by (BRCA) tumor category
Features
[gene name]

 Bottom-up (agglomerative) approach
 Initially, each sample is set as a cluster (leaf node) having only
one member.
1) Compute “inter-cluster distances” for every pair of clusters
(nodes without parent).
2) Select the pair with smallest distance and merge them to one.
(add a node in the tree connecting the two nodes)
 Repeat 1) and 2) until only one cluster is left
Source:
https://www.researchgate.net/
publication/273456906_Cluster
_Analysis_to_Understand_Socio
-Ecological_Systems_A_Guideline
/figures?lo=1

 Bottom-up (agglomerative) approach
 Inter-cluster distances:
distance between two clusters
 It can be defined as
 minimum (single-linkage)
 average (average-linkage)
 maximum (complete-linkage)
 of distances between every pair
of members (one from each
cluster)

 Top-down (divisive) approach
 Initially, we have only one cluster having all the sample as
its members. (root node)
1) Select the cluster having highest “intra-cluster distance”
(for example)
2) Apply k-means clustering to divide it into two.
 Repeat 1) and 2) until every cluster have only one member.
 Another name of this is “hierarchical k-means”

Density based clustering
 Density based clustering
 A cluster is defined as a set of samples that lie within a
relatively dense area.
 Clusters are divided by sparse area.
 Useful when clusters are not centralized (not radially
distributed)
 Two well known algorithm: DBSCAN and OPTICS
Source: https://untitledtblog.tistory.com/146

Density based clustering
 Density based clustering: DBSCAN
 Two parameters: (dist. threshold) and (# of points)
 Definition (core point): It is a point from which there are at
least points within a distance .
 First, divide all the points into core and non-core points.
 Assign cluster # to core points
1) Select a core point x of which cluster is not assigned yet.
2) Find all the core points that can be connected within a distance
to each other  assign a cluster # to these core point(s)
3) Repeat 1) and 2) to find all the core point clusters
 Assign cluster # to non-core points
1) For all non-core points, find the closest core point within the
distance and set its cluster to the cluster # of that core point.
2) If there is no core point within , it is simply regarded as outlier.

Distribution based clustering
 Distribution based clustering: Mixture model
 Use a PDF model (with parameters) to approximate
probability distribution of clusters
 The data distribution is modelled by a mixture of the PDFs
 A well-known, mathematically tractable one is Gaussian
mixture model (GMM), of which the data distribution is
modelled by
where is cluster index and is the number of clusters
 The objective is to find optimal model parameters ,
for that best fit to given data set.

Gaussian mixture model & EM algorithm
 Gaussian mixture model
|
where is cluster index and is the number of clusters
is a latent variable (은닉 변수)
 The objective is to find optimal model parameters ,
for that best fit to given data set.
 Issues
 May use likelihood as objective function
|
( consists of { }’s)
 Not easy to maximize as contains summation due to
the latent variable

 EM algorithm (in general)
 Use conditional likelihood given , i.e., assume
(the cluster of each sample ) is fixed
 Define conditional likelihood
|
 With this, we iteratively find and
 Steps
 Initialize (𝟎) and do the following while not converge
1) E-step:
𝒛|𝑿,𝜽 𝒕 𝒛
𝒕
2) M-step: ( )
𝜽
( )

 EM algorithm for Gaussian mixture model
 Conditional likelihood:
Steps
 Input:
 Initialize (𝟎)
 Do the following while not converge
1) E-step: ( ) 𝒙 ; 𝝁
( )
,𝑪
( )
∑ 𝒙 ; 𝝁
( )
,𝑪
( )
2) M-step:
( ) ∑
( )
∑ ∑
( )
( ) ∑
( )
𝒙
∑
( )
( ) ∑
( )
𝒙 𝝁 𝒙 𝝁
∑
( )
(See textbook section 13.2 for detail)

 Clustering with GMM
 Note
 The number of clusters must be fixed a priori.
 Variational EM can find a good number for implicitly.
 See “C. M. Bishop, Pattern Recognition and Machine
Learning, Springer” for variational EM
Source: https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

Dimension reduction
and data visualization

Non-linear feature dimension reduction: t-SNE
 Data (distribution) visualization
 Data visualization gives us a lot of information on the data,
its shape of distributions, the number of separable
clusters, and so on.
 One can also check if clustering is done properly and if
there is any outliers or not.
 Linear dimension reduction (PCA) is effective if the number
of clusters or the original feature dimension is small
enough.
 We discuss a non-linear dimension reduction technique,
t-distributed stochastic neighbor embedding (t-SNE).

 Requirement in general
 points close to each other in the original space must also
be close together in the new (low dimensional) space.
 The local structure (manifolds) in the original space is kept
in the new space with as little distortion as possible.
 Characteristics of t-SNE
 It’s a non-linear mapping
 Direct mapping: x in the original space  z in the new space,
obtained by solving an optimization problem
 If some new data is added, we need to perform
optimization again and the new mapping will be different
from the previous one.
 An upgraded version of SNE

 Elements
 Pairwise similarity in the original space:
 Pairwise similarity in the new space:
 Cost function:
 Definition
 Given data points , let be the point wise
mapping of in the new space.

𝒙 𝒙 /
∑ 𝒙 𝒙 /, :

𝒛 𝒛
∑ 𝒛 𝒛, :
𝒛 𝒛
∑ 𝒛 𝒛
 Both and are valid PMFs.

 Cost function: Kullback-Leibler divergence(KLD)
 Cost = KLD between and
(𝒁), :
 and hold if and are valid PMFs
 Optimization
 We want to find that minimize .
 Apply gradient descent, for which the gradient of
w.r.t. is given by
𝒛
(𝒁)
𝒛 𝒛
 More tricks were applied (see the original paper)

 Note
𝒛
(𝒁)
𝒛 𝒛
 Let be the original space and be the new space
 (direction of movement of ) = in is either toward
or the opposite
 The sign is determined by , i.e., it is toward if
(similarity in < that in or distance in > that in )
 The actual movement is given by the sum for all  make
and as close as possible

𝒛 𝒛
can be regarded as the rate of movement
 The rate of movement is large if and are close together
and vice versa  try to keep focused on local structure

 Comparison: PCA versus t-SNE
 400 dimensional features mapped to 2-dimensional features

 Perplexity: setting
 Perplexity is defined for a point as ( )
,
where | |
with |
𝒙 𝒙 /
∑ 𝒙 𝒙 /
 We make the perplexity roughly the same , i.e.,
 set smaller in dense region (many points nearby)
 set larger in sparse region (few points nearby)
 In this way, the effective number of points nearby is
made roughly the same
 Binary search can be used to find
 Typical value of perplexity is 5~50

 Practice: ML_practice7_clustering.ipynb
Computer Lab.

Machine learning and_neural_network_lecture_slide_ece_dku

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine learning and_neural_network_lecture_slide_ece_dku

Similar to Machine learning and_neural_network_lecture_slide_ece_dku (20)

Recently uploaded

Recently uploaded (20)

Machine learning and_neural_network_lecture_slide_ece_dku