SlideShare a Scribd company logo
Machine Learning and Neural Network
Course introduction
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Machine learning: Course introduction
 Target audience
 Senior undergraduate
 First year graduate student
 Prerequisite
 Linear algebra (선형대수 혹은 공학수학2)
 Basic probability and statistics (확률및통계학)
 Basic Python programming
 Textbook
 기계학습과 인공신경망 개론 (Ver1.xx)
 Download: https://www.slideshare.net/SeokhyunYoon1/
2019-09-26 2Machine learning and artificial neural network
Machine Learning and Neural Network
Ch.1: Introduction
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Ch.1 ML Introduction
 Objective: get “feel” and terminologies
1. What is machine learning? Concept and applications
2. What problems can ML solve?
 Classification, regression and clustering
 Supervised and unsupervised learning
3. Key elements of ML
 Data, Model and Cost
4. Design steps and issues in performance evaluation
2019-09-26 4Machine learning and artificial neural network
Machine learning: Introduction
 Major applications
 Pattern classification: Character/Speech recognition
 Object detection and tracking
 Time-series prediction (Stock price/market prediction,
weather forecast)
 Sentence completion and language translation
 … and much more
 Problems in machine learning
 Classification
 Regression
 Clustering
2019-09-26 5Machine learning and artificial neural network
Machine learning: Introduction
Related fields
2019-09-26 6Machine learning and artificial neural network
Machine learning
Probability and
statistics
Data science
Cognitive
science
Artificial
Intelligence
Computer science
Big data
Data mining
Linguistics
Psychology, neuro-science
Neural Network
Machine learning: Introduction
 Elements of machine learning in classification and
regression problem
 Prediction model ( ) with parameters ( )
 Data (observations and their target values )
 Cost(loss)/Objective function to minimize/maximize
 Algorithm to efficiently obtain the optimal or a good
solution
2019-09-26 7Machine learning and artificial neural network
Data: 𝒊 𝟏
𝑵
Model with parameters:
Cost/loss:
Algorithm to solve ∗
𝜽
Machine learning: Introduction
 Machine learning process
2019-09-26 8Machine learning and artificial neural network
Existing data
𝒊 𝟏
𝑵
Machine learning
Algorithm
∗
𝜽
Model
(with parameters)
∗
New data Prediction
∗
Machine learning: Introduction
 Classification and regression
2019-09-26 9Machine learning and artificial neural network
cat (Smiling)
X
y
1 2 3 4 5 6
1
2
3 X = 2.5, y=?
X = 6.0, y=?
Existing dataNew data
Machine learning: Introduction
 Given an observation (which can be a vector, a
matrix (image) or a tensor)
 Classification determines its class among a set of classes
 Regression estimates/predicts unobserved variables
 Regression can be a prediction of future trend or
interpolation of some missing information
 Classification vs. regression
 In classification, is a discrete, categorical value drawn
from a finite set
 In regression, is a numerical value
2019-09-26 10Machine learning and artificial neural network
Machine learning: Introduction
 Machine learning is all about to find and
 How to find the best or, at least, a good ?
 Given , how to find the best or, at least, a good ?
 The best or a good for what and in what sense ?
 Why do we need pre-collected data for learning/training ?
2019-09-26 11Machine learning and artificial neural network
Data: 𝒊 𝟏
𝑵
Model with parameters:
Cost/loss:
Algorithm to solve ∗
𝜽
Machine learning: Introduction
 Some terminologies
 Learning/Training/Model fitting: process to find the model
parameters ( ) that best fit to given data in terms of the
predefined cost/objective
 Supervised learning: target values ( ) are provided
• Classification, regression
 Unsupervised learning: no target values provided
• Clustering
2019-09-26 12Machine learning and artificial neural network
Data: 𝒊 𝟏
𝑵
Model with parameters:
Cost/loss:
Algorithm to solve ∗
𝜽
Machine learning: Introduction
 Design steps (supervised learning)
1. Define the function you want to implement (define input
and output )
2. Design your model , intuitively and smartly
3. Collect data and curate them to set
4. Train the model to get ∗
5. Use ∗
to evaluate the performance
6. If satisfied, you are done! Otherwise, go to step 2 (skip 3).
 Step 2 requires strong/some mathematical background
 Step 3 is typically time-consuming and sometimes requires
domain expertise (e.g. for medical application)
2019-09-26 13Machine learning and artificial neural network
Machine learning: Introduction
 Design steps for beginner (supervised learning)
1. Choose a function you want to implement (input/output
formats are pre-defined)
2. Search for some open SW packages to choose/construct
an appropriate model and try to modify slightly
3. Download dataset ( , ) from the internet
4. Use the packages to train the model to get ∗
5. Use ∗ to evaluate the performance
6. If satisfied, you are done! Otherwise, go to step 2 (skip 3).
2019-09-26 14Machine learning and artificial neural network
Machine learning: Introduction
 Parameters and hyper parameters
 Most of the models have some hyper-parameters that
are pre-defined before training
 Must be optimized for performance, computing costs …
 may need grid search to find the best combination of
hyper parameters.
2019-09-26 15Machine learning and artificial neural network
Machine learning: Introduction
 Performance evaluation of classifier/regressor
 Must consider “generalization error”
 Typical performance measures
 Classification: Accuracy
 Regression: Mean Squared Error , R2 measure
2019-09-26 16Machine learning and artificial neural network
그림 1.1 분류기/추정기의 학습과 테스트
Machine learning: Introduction
 Clustering
 No target values for observations
 Objective is to divide data into a set of groups based on
some similarity measures
 Need to devise procedures to efficiently group data
 Data (distribution) visualization may help
 Once clustered, the data can be used for classification
2019-09-26 17Machine learning and artificial neural network
Machine learning: Introduction
 Two typical similarity measures
 Euclidian distance:
 Correlation:
𝒙 𝒙
𝒙 𝒙
 Need to consider symmetricity and their ranges
 Note
 L-p norm of a vector:
/
 Default value of p = 2 
 Schwartz’s inequality:
2019-09-26 18Machine learning and artificial neural network
Machine learning: Introduction
 Simplest classifier: k nearest neighbor (knn) classifier
 Training data 𝒊 𝟏
𝑵
used as templates
 Given new input data , it determines its class as follows
1. Compute (may use other similarity measure)
2. Select k candidates nearest to
3. Use majority vote to determine the class of
2019-09-26 19Machine learning and artificial neural network
Existing data
𝒊 𝟏
𝑵
knn classifierNew data Prediction
Machine learning: Introduction
 k nearest neighbor (knn) as regressor
 Training data 𝒊 𝟏
𝑵
used as templates
 Given new input data , it determines its class as follows
1. Compute (may use other similarity measure)
2. Select k candidates nearest to
3. Take average of k candidates to determine the estimates
2019-09-26 20Machine learning and artificial neural network
Existing data
𝒊 𝟏
𝑵
knn regressorNew data Prediction
Machine Learning and Neural Network
Ch.2: Data and descriptive statistics
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Ch.2 Data and descriptive statistics
 Topics
1. Data: types and representation
2. Descriptive statistics
 Scatter plot and histogram
 Mean, correlation and covariance
2019-09-26 22Machine learning and artificial neural network
Data and descriptive stat.
 Terminologies and notation
 Observation/sample/feature vector (for now, assume that
it is a vector).
 Target value : desired value for a sample
 In supervised learning, and should be paired ( , )
 Collection of data:
2019-09-26 23Machine learning and artificial neural network
Each column is a sample
each row is a feature
Data and descriptive stat.
 Two types of data:
 Categorical
 Numerical
 Categorical value is typically mapped to an integer
to make it suitable for computation
 ex: T  1, F  0
 Blood type: O  0, A  1, B  2, AB  3
2019-09-26 24Machine learning and artificial neural network
Data and descriptive stat.
 An example of multivariate (다변량) data
 Data consisting of 20 samples
 Each column is one sample with 4 features, (Group, English,
Math, Science score)  call it feature vector
 where Group is categorical and others are numerical
2019-09-26 25Machine learning and artificial neural network
sid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ave. SD.
Group A A A A A A A A A A B B B B B B B B B B
English 77 81 74 89 78 77 75 84 79 82 69 76 74 67 74 71 67 67 70 69 75 6.09
Math 72 67 74 64 71 67 72 68 75 70 83 78 76 82 80 77 83 84 80 80 75 6.09
Science 75 68 72 68 74 68 75 73 79 72 78 75 72 74 77 70 75 79 77 76 74 3.47
a sample/observation
Data and descriptive stat.
 Example problems
 Classification:
Given , determine
 Regression:
Given , estimate
2019-09-26 26Machine learning and artificial neural network
sid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ave. SD.
Group A A A A A A A A A A B B B B B B B B B B
English 77 81 74 89 78 77 75 84 79 82 69 76 74 67 74 71 67 67 70 69 75 6.09
Math 72 67 74 64 71 67 72 68 75 70 83 78 76 82 80 77 83 84 80 80 75 6.09
Science 75 68 72 68 74 68 75 73 79 72 78 75 72 74 77 70 75 79 77 76 74 3.47
a sample/observation
Data and descriptive stat.
 Data visualization: scatter plot and histogram
2019-09-26 27Machine learning and artificial neural network
 Empirical (Probability)
density gives us lots of
information for the
design and performance
of classifier, regressor
and clustering algorithm
 One or two dimensional
(bivariate) data is easy
to visualize
 While, more than 2D is
hard
 Pairwise scatter plot is
affordable for small M
Data and descriptive stat.
 Problems in machine learning
2019-09-26 28Machine learning and artificial neural network
Classification
Regression Clustering
Few prob. distribution
models can be successfully
applied to practical dataset
That’s why we resort to
machine learning based on
a collection of samples
Data and descriptive stat.
 Mean, Correlation and Covariance
 Consider dataset N samples and M features
 (Per-feature) mean:
 (Per-feature) variance:
 is the standard deviation
 ’s and ‘s can be collectively represented as a vector
2019-09-26 29Machine learning and artificial neural network
Data and descriptive stat.
 Mean, Correlation and Covariance
 Dataset N samples and M features
 Correlation (for a pair of features):
 Covariance (for a pair of features):
 , (symmetric)
 ’s and ’s can be collectively represented as matrices
2019-09-26 30Machine learning and artificial neural network
Data and descriptive stat.
 Mean, Correlation matrix and Covariance matrix
 Consider dataset N samples and M features
 Mean (vector): 𝑿
 Correlation matrix: 𝑿𝑿
 Covariance matrix: 𝑿𝑿 𝑿𝑿 𝑿
 Cross correlation: 𝑿𝒚
 Cross covariance: 𝑿𝒚 𝑿𝒚 𝑿 𝒚
2019-09-26 31Machine learning and artificial neural network
Size:
Size:
Data and descriptive stat.
 Properties of (and )
 𝑿𝑿
𝑻
𝑿𝑿 (symmetric)
 𝑿𝑿 is non-negative definite, such that, for any vector ,
𝑿𝑿
 The eigen values are all non-negative and their eigen vectors
form an orthonormal basis, i.e., with eigen decomposition
𝑿𝑿 ,
diagonal elements of are all non-negative real and
 𝑿𝑿
 If (the number of samples is less than the number of
features), then 𝑿𝑿 has at most non-zero eigen values. (all
others are zero). In this case, 𝑿𝑿 is not invertible
 These properties also hold for 𝑿𝑿
2019-09-26 32Machine learning and artificial neural network
Data and descriptive stat.
 For the two data matrices
and ,
 Find 𝑿
 Find 𝑿𝑿 and 𝑿𝑿
 Check if 𝑿𝑿 and 𝑿𝑿 satisfies the properties in the
previous slide.
2019-09-26 33Machine learning and artificial neural network
Data and descriptive stat.
 Example (문제 2.2)
 Find the correlation and covariance between
• English and math
• English and science
• Math and science
 Find 𝑿𝑿 and 𝑿𝑿
 Check if 𝑿𝑿 and 𝑿𝑿 satisfies the properties in the
previous slide.
2019-09-26 34Machine learning and artificial neural network
sid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ave. SD.
English 77 81 74 89 78 77 75 84 79 82 69 76 74 67 74 71 67 67 70 69 75 6.09
Math 72 67 74 64 71 67 72 68 75 70 83 78 76 82 80 77 83 84 80 80 75 6.09
Science 75 68 72 68 74 68 75 73 79 72 78 75 72 74 77 70 75 79 77 76 74 3.47
Homework & Computer Lab.
 Homework: 2.1, 2.2
2019-09-26 35Machine learning and artificial neural network
Machine Learning and Neural Network
Ch.3: Multi-variate Gaussian PDF
and linear transform
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Ch.3 Multivariate Gaussian PDF & linear transform
 Topics
1. Multi-variate Gaussian PDF
 Pearson’s correlation coefficient
2. Linear transformation
 Principal axes transform and whitening
3. Principal component analysis (PCA)
2019-09-26 37Machine learning and artificial neural network
Multivariate Gaussian PDF: definition
 Definition of multivariate Gaussian (Normal) PDF
 Consider a Gaussian random vector 𝑻
 The PDF of is defined, in general, as
/ 𝑪 /
𝑻
where
is mean,
is covariance matrix,
 Note
 Quadratic form (2차형식) 𝑻
is a scalar ( , ×
)
 Mahalanobis distance: 𝑻
(Symmetric)
2019-09-26 38Machine learning and artificial neural network
Multivariate Gaussian PDF
 3 cases of bivariate Gaussian (Normal) PDF
 Case 1: 𝝁 =
0
5
, 𝑪 =
9 0
0 9
, Case 2: 𝝁 =
0
5
, 𝑪 =
1 0
0 16
 Case 3: 𝝁 =
0
5
, 𝑪 =
9 −10
−10 16
2019-09-26 39Machine learning and artificial neural network
Mean is just a
“translation”
Contour plot
Multivariate Gaussian PDF
 Let’s take a closer look
 “Contour” can be obtained from 𝑻
 Suppose that for simplicity  𝑻
 Suppose also that  𝟏 𝟏
 Then, we have
𝑧
𝜎
− 2𝜌
𝑧
𝜎
𝑧
𝜎
+
𝑧
𝜎
= 𝑐′(1 − 𝜌 )
 where is Pearson correlation coefficient defined as
satisfying
 We say
and are uncorrelated if
and has perfect correlation if
2019-09-26 40Machine learning and artificial neural network
This is an ellipse
Multivariate Gaussian PDF
 Examples
 Pearson correlation coefficient between two random
variables (two features) and is defined as
satisfying
 We say that
and are uncorrelated if
and has perfect correlation if
2019-09-26 41Machine learning and artificial neural network
Data and descriptive stat.
 What can you see?
2019-09-26 42Machine learning and artificial neural network
 Are Math and English
scores correlated ?
 What can you say
about Math and English
score? Set up your
hypothesis.
 Use the figure in the
previous page to
roughly estimate the
Pearson correlation
coefficient.
Multivariate Gaussian PDF (참고사항)
 Marginalization of an M-variate Gaussian PDF is also
a Gaussian PDF with (M-1)-variates
 𝒊 𝒊
 Successive marginalization gives us a univariate
Gaussian PDF

2019-09-26 43Machine learning and artificial neural network
Linear transform
 Definition of a linear transformation
 For any matrix of size (KxM), linear transform of a vector
of size (Mx1) is defined as
 Linear transform is a projection of onto the row space of
 Linear transform of a Gaussian random vector
 Suppose that be a Gaussian RV with mean and cov. , i.e.,
 Then, for any matrix , the linear transform is also
Gaussian with mean and covariance , i.e.,
 Try to verify using the def. of mean and covariance in Ch.2
2019-09-26 44Machine learning and artificial neural network
Linear transform
 Principal axes transformation and Whitening
 Suppose that (eigen-decomposition of ) ,
: diagonal matrix with ( th eigen value)
: eigen basis ( th column is the eigen vector for )
 (Principal axes transform) The linear transform by using
as transform matrix, is Gaussian with PDF
 (Whitening) By using / as transform matrix,
/ is also Gaussian with PDF
/
2019-09-26 45Machine learning and artificial neural network
Principal Component Analysis (PCA)
 Principal component analysis (PCA)
 With
 PCA uses several (typically two) eigen vectors corresponding
to the largest eigen values as projection matrix.
 Let
• ( , ) be the two largest eigen values
• ( , ) be the corresponding eigen vectors
 We use as transform matrix
 The distribution of can be easily visualized in a low
dimensional (e.g., 2D) space.
 If
𝑪
, contains most of the information on , i.e.,
2019-09-26 46Machine learning and artificial neural network
Data (distribution) visualization
 Pairwise scatter plot is NOT affordable for large M
2019-09-26 47Machine learning and artificial neural network
M = 4 M = 64 (showing only 10 features)
Data (distribution) visualization
2019-09-26 48Machine learning and artificial neural network
Pair-wise scatter plots of Iris dataset
(3 classes, 4 dimensional feature)
2 dimensional projection provides
better representation of clusters
and similarity between feature
Data (distribution) visualization
2019-09-26 49Machine learning and artificial neural network
Pair-wise scatter plots of Digits dataset
(10 classes, 64 dimensional feature)
Showing only first 10x10
2 dimensional projection provides
better representation of clusters
and similarity between feature
Homework & Computer Lab.
 Homework: 3.1~3.6
 Practice:
ML_practice0_ch3_data_visualization_190817c.ipynb
2019-09-26 50Machine learning and artificial neural network
Machine Learning and Neural Network
Appendix A: Optimization I
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Appendix: Optimization
 Topics
1. Optimization I: Unconstrained optimization
 Definition of optimization problem
 Quadratic programming problem
 Maximum likelihood estimation as an optimization problem
2. Optimization II: Iterative solutions
 Gradient descent and stochastic gradient descent
 Coordinate descent
 Newton-Raphson method
3. Optimization III: Constrained optimization
 Definition
 Lagrange multiplier and Rayleigh quotient optimization
 Duality in constrained optimization and KKT condition
2019-09-26 52Machine learning and artificial neural network
Unconstrained optimization
 Definitions of unconstrained optimization
 Minimization:
𝜽∈ℝ
or ∗
𝜽
 Minimization:
𝜽∈ℝ
or ∗
𝜽
where is a cost/objective function.
 Convex optimization
 If is a convex function, the solution can be obtained by
solving (as there is only one minimum (maximum)
𝜽
where 𝜽 is the gradient operator
2019-09-26 53Machine learning and artificial neural network
Unconstrained optimization: QP problem
 Quadratic programming (QP) problem
 QP problem is a special case of convex optimization problem
 is a quadratic function of , i.e.,
𝜽
 Since is a convex function, the solution is given by solving
 Solution:
∗
𝜽
(if is invertible)
2019-09-26 54Machine learning and artificial neural network
Unconstrained optimization: Gradient formula
 Gradient operators
 For vector : 𝜽 𝜽
 For matrix : 𝑨 𝑨
 Gradient formula
 𝜽 𝜽
 𝜽
 𝑨
 𝑨
𝟏
2019-09-26 55Machine learning and artificial neural network
Unconstrained optimization: Gradient formula
 Example (문제 A.1):
 minimize , i.e., find ∗ ∗
that minimize
and find also the minimum value ∗ ∗
 Express in vector-matrix form, i.e.
 Use the vector-matrix form to minimize
(use the gradient formula)
 Repeat for
2019-09-26 56Machine learning and artificial neural network
Maximum likelihood estimation
 Given
 Data samples:
 PDF model: with unknown parameter
 We want to find that maximize
 likelihood of :
 Or log-likelihood:
 It is a maximization problem
∗
𝜽∈ℝ 𝜽∈ℝ
2019-09-26 57Machine learning and artificial neural network
MLE example: Bernoulli trial
 Given
 Data samples: , where
 PDF model: ( )
with
 Parameter to estimate:
 Likelihood function
 Solution:
∗
2019-09-26 58Machine learning and artificial neural network
Try to verify this by maximizing
the likelihood or log-likelihood function.
where k is the number of 1’s
occurred in N trials
MLE example: Multi-variate Gaussian PDF (선택)
 Given
 Data samples: , where
 PDF model:
where : mean, : covariance matrix  parameters to estimate
 Log-Likelihood function
with 𝑻
 Solution:

𝟏
𝑵
𝑵

𝟏
𝑵
𝑻 𝟏
𝑵
2019-09-26 59Machine learning and artificial neural network
Try to verify this using
gradient formula.
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Machine Learning and Neural Network
Ch.4: Regression
Roadmap
2019-09-26 61Machine learning and artificial neural network
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1
Ch.4 Regression
 Topics
1. Linear regression
2. Vector-matrix representation of linear regression
3. Linear prediction
4. Non-linear regression and overfit
5. Performance evaluation: cross-validation
2019-09-26 62Machine learning and artificial neural network
Regression
 Elements of regression problem
 Prediction model ( ) with parameters ( )
 Data (observations and their target values )
 Cost(loss)/Objective function to minimize/maximize
 Algorithm to efficiently obtain the optimal or a good
solution
2019-09-26 63Machine learning and artificial neural network
Data: 𝒊 𝟏
𝑵
Model with parameters:
Cost/loss:
Algorithm to solve ∗
𝜽
Regression: Linear regression
 A simple example of linear regression
 Data: where
 Model: where parameter
 Problem is to find the best for given
 Best in what sense ?
2019-09-26 64Machine learning and artificial neural network
x
y
(xi, yi)
Regression: Linear regression
 Least squares solution (최소제곱법)
 We want to minimize the residual sum of squares (RSS)
 Define error:
 Minimize:
;𝜽
 where, is a quadratic (convex) function of and
 Can use 𝜽 to find and in terms of
2019-09-26 65Machine learning and artificial neural network
Regression: Linear regression
 Generalization to multi-variate data
 Data: where ,
 Model:
where parameter
 Cost function: Residual sum of squares (RSS)
 where

;𝜽
 Problem is to find ∗
𝜽∈ℝ
2019-09-26 66Machine learning and artificial neural network
Regression: Model structure
 Model and its training at a glance
2019-09-26 67Machine learning and artificial neural network
Regression: Linear regression
 Solution
 is a quadratic function of ’s (convex function)
 Can use 𝜽 to obtain a system of equations
 Then, solve the system of equations to get ∗
:
:
 Equivalently, in vector-matrix form, 𝑿 𝑿 𝑿 𝒚
where
𝑻
, 𝑿 𝑿 , 𝑿 𝒚
2019-09-26 68Machine learning and artificial neural network
Regression: Vector matrix notation
 Vector-matrix notation
 Data: where ,
 Model:
where ,
 Cost function: Residual sum of squares (RSS)
 Error vector:

 𝑿 𝑿 𝑿 𝒚
2019-09-26 69Machine learning and artificial neural network
where 𝑿 = 𝟏 𝑻
𝑿
=
1 1
𝑥 𝑥
… 1
⋯ 𝑥
⋮ ⋮
𝑥 𝑥
⋱ ⋮
⋯ 𝑥
Regression: Vector matrix notation
 Vector-matrix notation
 Problem is to find the solution of 𝜽 , which is
𝜽 𝑿 𝑿 𝑿 𝒚
𝑿 𝑿 𝑿 𝒚
 Solution: ∗
𝑿 𝑿 𝑿 𝒚
 Unique solution exists only if 𝑿 𝑿 is invertible!
2019-09-26 70Machine learning and artificial neural network
Regression: Linear regression example
 Example
 We want to estimate English score using two models
영어점수 수학점수
영어점수 수학점수 과학점수
 Find ( , ) and ( , , ), respectively. (you may use the
results in 문제 2.2)
 Homework: finish problem 4.1 and 4.2
2019-09-26 71Machine learning and artificial neural network
sid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ave. SD.
English 77 81 74 89 78 77 75 84 79 82 69 76 74 67 74 71 67 67 70 69 75 6.09
Math 72 67 74 64 71 67 72 68 75 70 83 78 76 82 80 77 83 84 80 80 75 6.09
Science 75 68 72 68 74 68 75 73 79 72 78 75 72 74 77 70 75 79 77 76 74 3.47
Regression: Linear prediction
 Linear prediction
 Given time series data
 Use p previous samples to predict the next sample, i.e., we
want to predict using ( )
 Model: ( )
 Example 4.3
2019-09-26 72Machine learning and artificial neural network
𝑡 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
𝑥 -4 -3 14 8 1 -5 -7 -4 -2 6 10 22 15 -15 -20 ?
Regression: Linear prediction
 Linear prediction
 Target value:
 Data matrix:
 Model: ( ) (no intercept)
:
 Solution: ∗
𝜽
𝑿𝑿
𝟏
𝑿𝒚
 Prediction: ∗ ∗ ( )
 Note: 𝑿𝑿 is a Toeplitz matrix
2019-09-26 73Machine learning and artificial neural network
Regression: Linear prediction
 Homework: Example 4.3
1) 예측 차수 에 대해 와 를 나타내고 𝑿𝑿와 𝑿𝒚를 구하라.
에 대해 선형 예측기 파라미터 ∗
를 구하고 를 예측해 보아라.
3) 평균 제곱 오차 ( )
를 구하라. (N=14)
에 대해 (1)~(3)을 반복하라.
5) 시계열 데이터의 분산 를 구하고 (여기서
), 에 대해 를 구하라.
6) (5)의 결과에 대해 간략히 비교 설명하라.
2019-09-26 74Machine learning and artificial neural network
𝑡 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
𝑥 -4 -3 14 8 1 -5 -7 -4 -2 6 10 22 15 -15 -20 ?
Regression: Non-linear model and overfit
 Example of Non-linear regression
 Two-feature data
 Non-linear model:
where
 Defining , RSS cost gives us ∗
𝑿 𝑿 𝑿 𝒚
 Note
 The model is non-linear in ’s, but linear in ’s
 RSS cost function gives us a linear system of equations
2019-09-26 75Machine learning and artificial neural network
Regression: Non-linear model and overfit
 Considerations for non-linear regression
 If the model is non-linear function of ’s, the problem
(finding solution) become complicated.
 Non-linear model is subject to overfit (large generalization
error), especially when the number of samples is relatively
small compared to the number of parameters in the model.
 We need to check if the model is overfitted to data or not.
2019-09-26 76Machine learning and artificial neural network
출처: https://slideplayer.com/slide/6825533/
Regression: Non-linear model and overfit
 Overfit, underfit and just(appropriate) fit
2019-09-26 77Machine learning and artificial neural network
source: https://slideplayer.com/slide/6825533/
source : https://towardsdatascience.com/underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6fe4a8a49dbf
Regression: Non-linear model and overfit
 How to check if the model is overfitted or not
 If the model is overfitted, generalization error is much (?)
larger than the minimized cost for training data, i.e.,
∗ ∗
where ∗ was obtained based on
 That’s why we divide data (samples) for training and test
for performance evaluation
 More systematic approach to test overfit: cross validation
2019-09-26 78Machine learning and artificial neural network
Regression: Non-linear model and overfit
 L-fold cross-validation (교차 검증)
1. We divide the entire data (of N samples) into L groups (of
N/L samples per group)
2. Select one group for test and use all others for training
3. Measure ∗ and ∗
4. Repeat 2 and 3 for each group and take average on both
measures
5. Check if ∗ ∗
2019-09-26 79Machine learning and artificial neural network
Homework & Computer Lab.
 Homework: 4.1, 4.2, 4.3
 Computer Lab: ML_practice1_regression_ex_190820.ipynb
2019-09-26 80Machine learning and artificial neural network
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Machine Learning and Neural Network
Ch.5: Regularization
Roadmap
2019-09-26 82Machine learning and artificial neural network
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1
Ch.5 Regularization
 Topics
1. Ridge regression
2. LASSO regression
3. Elastic-net
2019-09-26 83Machine learning and artificial neural network
Regularization
 Recall linear regression
 Data: where ,
 Model:
where ,
 Cost function: Residual sum of squares (RSS)
 where
 𝑿 𝑿 𝑿 𝒚
 Problem is to find the solution of 𝜽 , which is
𝜽 𝑿 𝑿 𝑿 𝒚  ∗
𝑿 𝑿 𝑿 𝒚
 Unique solution exists if 𝑿 𝑿 is invertible! What if it is NOT?
2019-09-26 84Machine learning and artificial neural network
Regularization
 In what case, is NOT invertible?
 It is not if N < M, i.e., when the number of samples less
than the number of features (e.g., as in bioinformatics,
medical application)
 Infinite number of solutions exist
 The model parameters and performance can be highly
variable with a small changes in data (overfit)
 Two possible approaches
 Increasing sample size (noise injection)
 Reducing feature dimension (selecting good features)
2019-09-26 85Machine learning and artificial neural network
Regularization
 Increasing sample size (noise injection)
 One can double the number of samples by generating new
set of data where is random noise matrix with
covariance , i.e., ,
 Then, use as new data
 Note that 𝑿 𝑿 𝑿 𝑿 , which is now
invertible “anyway” if 2N > M
 It is effectively a “noise injection”
 generalization error can be reduced to some extent
 If needed, one can add more with different random noise.
 The noise variance must be chosen carefully.
 Note: the distribution of may not model well the
true distribution of .
2019-09-26 86Machine learning and artificial neural network
Regularization
 Reducing feature dimension (selecting features)
 One can select M’ (<N) features, for example, having
highest covariance with target value y.
 However, this does not guarantee a better performance.
 An efficient feature selection method (LASSO) will be
discussed shortly
2019-09-26 87Machine learning and artificial neural network
Regularization: Ridge and LASSO
 Ridge and LASSO regression: RSS + L1/L2 Penalty


 Lp-norm:
/
 controls the relative weight between RSS and penalty
 Elastic net: RSS + L1 + L2 Penalty

2019-09-26 88Machine learning and artificial neural network
Regularization: What is the impact of penalty?
 Ridge regression
 Ridge regression is simply a QP problem
 And the solution is ∗
𝑿 𝑿 𝑿 𝒚
 𝑿 𝑿 is invertible with , even if 𝑿 𝑿 is not. (문제6.3 참고)
 It is effectively a “noise injection” (an increase of sample size)
 and generalization error can be reduced to some extent
2019-09-26 89Machine learning and artificial neural network
Regularization: What is the impact of penalty?
 LASSO regression
 LASSO stands for Least
Absolute Shrinkage and
Selection Operator
 It tends to select features
that describe well the
target value, y
 some ’s vanish if the
corresponding features
doesn’t have strong
correlation to y
 LASSO effectively reduce M,
rather than to increase N.
2019-09-26 90Machine learning and artificial neural network
Regularization: What is the impact of penalty?
 Further remarks on LASSO regression
 controls sparsity (high selects less features)
 LASSO tends to select one feature from a group of highly
correlated variables (features) and ignore the rest.
 Unlike , L1-penaty is not differentiable at
 LASSO regression is convex optimization problem, while it is
NOT a simple QP problem
 use iterative algorithm to find the solution, especially when M>N
(Coordinate descent algorithm to be discussed next)
 See textbook for coordinate descent algorithm for LASSO
2019-09-26 91Machine learning and artificial neural network
Regularization: Elastic-net
 Elastic-net
 Elastic-net combines L1 and L2 penalty
 L1-penalty selects features (generating sparse model)
 L2-penalty reduces generalization error and also encourage
grouping effects.
2019-09-26 92Machine learning and artificial neural network
Homework & Computer Lab.
 Homework: 6.2, 6.3
 Computer lab: ML_practice1_regression_ex_190820.ipynb
Machine Learning and Neural Network
Appendix C: Optimization III
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Appendix: Optimization
 Topics
1. Optimization I: Unconstrained optimization
 Definition of optimization problem
 Quadratic programming problem
 Maximum likelihood estimation as an optimization problem
2. Optimization II: Iterative solutions
 Gradient descent and stochastic gradient descent
 Coordinate descent
 Newton-Raphson method
3. Optimization III: Constrained optimization
 Definition
 Lagrange multiplier and Rayleigh quotient optimization
 Duality in constrained optimization and KKT condition
2019-09-26 94Machine learning and artificial neural network
Unconstrained optimization
 Definitions of unconstrained optimization
 Minimization:
𝜽∈ℝ
or ∗
𝜽
 Minimization:
𝜽∈ℝ
or ∗
𝜽
where is a cost/objective function.
 Convex optimization
 If is a convex function, the solution can be obtained by
solving (as there is only one minimum (maximum)
𝜽
 Sometimes, however, one cannot get closed form solution.
 What can we do, then ?
2019-09-26 95Machine learning and artificial neural network
Iterative search for minimum/maximum
 One idea: gradient search
 Gradient descent
 Hill climbing
 Steps
 Given cost function J()
 Initialize n = 0, (n) = 0
 Loop (epoch):
1. Compute gradient at the current
position, 𝜽 𝜽 𝜽( )
2. Update param., ( ) ( )
3. n  n+1
4. Repeat 1~3 until convergence
2019-09-26 96Machine learning and artificial neural network
 𝜂: Learning rate, 0 < 𝜂 ≪ 1
 Small enough 𝜂 ensures that
 Large 𝜂: Fast convergence, but
high MSE due to bouncing
 Small 𝜂: Slow convergence, while
lower MSE
𝐽 𝜽 ≥ 𝐽 𝜽
Iterative search for minimum/maximum
 Stochastic gradient descent (SDG)
 Cost is typically a sum of per-sample cost
 Update for every sample
 Steps
 Initialize  = 0
 Outer Loop (epoch): for n = 1,2,…
• Inner loop: for i = 1,2,…,N (number of samples)
( ) ( )
𝜽
• Repeat inner loop until convergence
2019-09-26 97Machine learning and artificial neural network
Iterative search for minimum/maximum
 In linear regression

 𝜽 (gradient of per-sample cost)
 SGD for linear regression
 Initialize  = 0, n=0
 Outer Loop (epoch): for n = 1,2,…
• Inner loop: for i = 1,2,…,N (number of samples)
𝑒
( )
= 𝑦 − 𝒙 𝜽( )
( ) ( ) ( )
• Repeat inner loop until convergence
2019-09-26 98Machine learning and artificial neural network
Iterative search for minimum/maximum
 Using momentum
 In SGD, if each sample contains “noise”, it disturbs the
algorithm, i.e., parameter may move to incorrect direction
 It can be alleviated using momentum
( ) ( )
𝜽
( ) ( ) ( )
where
2019-09-26 99Machine learning and artificial neural network
Iterative search for minimum/maximum
 Coordinate descent
 Rather than to update every parameters at a time
 Update parameters one by one (one coordinate at a time)
𝜃 = argmin 𝐽 𝜃 , 𝜃 , … , 𝜃 , 𝜃 , 𝜃 , … , 𝜃
 is given by the solution of equation
𝐽(𝜽)
𝜽𝒌 [ , ,…, , ,…, ]
= 0
 Simpler implementation
𝜃 = argmin 𝐽 𝜃 , 𝜃 , … , 𝜃 , 𝜃 , 𝜃 , … , 𝜃
2019-09-26 100Machine learning and artificial neural network
Iterative search for minimum/maximum
 Coordinate descent for linear regression
 Cost: 𝟐


 With , and , , we have
, , ,
 Update rule:
( )
,
,
( )
,
 Homework: C.1
2019-09-26 101Machine learning and artificial neural network
Machine Learning and Neural Network
Ch.6: Classification
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Ch.6 Classification: problem formulation
 Topics
1. Bayesian approach
2. Bayesian approach under Gaussian assumption
 Decision boundary
3. Linear model as a special case
2019-09-26 103Machine learning and artificial neural network
Classification: Problem formulation
 Data
 Data: where ,
 where is a set of categories(classes)
 ’s are categorical and discrete
 Bayesian approach: probabilistic model
 Assume each class (kth class) is distributed ~ p(x|Hk).
 Given new data x, decide its class y as
∈
 i.e., select class index for which the conditional
probability of x is maximum
2019-09-26 104Machine learning and artificial neural network
Classification: Bayesian approach
 Binary classification
 Assume binary classification (for simplicity), i.e.,
 Given new data x, decide its class y by comparing log-
likelihood
 Binary classification under Gaussian assumption
 Assume with parameter and .
 Then, we have
𝑻 𝑻
2019-09-26 105Machine learning and artificial neural network
Classification: Bayesian approach
 Binary classification under Gaussian assumption
 Suppose that . Then, we have
𝑻 𝑻
 i.e., compare (Mahalanobis) distances of x from class centers
2019-09-26 106Machine learning and artificial neural network
𝑝 𝒙|𝐻 𝑝 𝒙|𝐻
Classification: Decision boundary
 Decision boundary
 It is a “surface” where , i.e.,
𝑻 𝑻
 It can be written as
𝑻 𝑻
 where
(a vector)
(a scalar)
 The decision boundary is given by “conic section”
 which can be an hyperbola, an ellipse or a (hyper) plane
2019-09-26 107Machine learning and artificial neural network
Classification: Linear model
 Linear model for binary classification
 Suppose further that .
 Then, the decision boundary becomes
𝑻 𝑻 𝑻
 which is a (hyper) plane
 And the decision rule becomes
𝑻
or equivalently, 𝑻
 Model parameter: and (intercept)
 Linear classifier partitions ( ? ) into
non-overlapping areas using ( ? )
2019-09-26 108Machine learning and artificial neural network
Classification: Linear model vs. Bayesian approach
 Bayesian classifier versus linear classifier
2019-09-26 109Machine learning and artificial neural network
𝑝 𝒙|𝐻 𝑝 𝒙|𝐻
𝜽 𝑻
𝒙 + 𝜃 = 0
Classification: Summary
 Binary classification: summary
 Bayesian approach
 Under Gaussian assumption (with )
𝑻 𝑻
 With , we get linear model
𝑻
or equivalently, 𝑻
2019-09-26 110Machine learning and artificial neural network
Our main focus is
on this linear model
Classification: Naive implementation
 Naive implementation
 Given data: where ,
 is a set of categories (classes)
 ’s are categorical and discrete, e.g.,
 Divide data into and (for each class)
 Compute for
 Use ’s for classification
 This is not our focus, though.
2019-09-26 111Machine learning and artificial neural network
Classification: Roadmap
 Based on the model,
 Ch.7: We will develop training (learning) rule, where we
obtain and directly from data by solving an
optimization problem
 Ch.8: The linear model will be extended for multinomial
classification problem
 Ch.9: The model will be further extended to get neural
network model
:
2019-09-26 112Machine learning and artificial neural network
Homework & Computer Lab.
 Homework: 5.1, 5.2, 5.3
2019-09-26 113Machine learning and artificial neural network
Machine Learning and Neural Network
Ch.7: Logistic Regression
(binary classification)
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Roadmap
2019-09-26 115Machine learning and artificial neural network
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1
Ch.7 Logistic Regression for binary classification
 Topics
1. Logistic regression:
 Model with logistic sigmoid function
2. Parameter optimization:
 Likelihood function as an objective function
 Application of gradient search algorithm
3. Performance measures of binary classifier
 Confusion matrix, True Positive and False negative
 Accuracy, Sensitivity, Specificity
 ROC and AUC
2019-09-26 116Machine learning and artificial neural network
Logistic regression: Model
 Recall (generalized) linear model for binary
classification,
 It is a linear regressor if
 It is a linear classifier if
 It is a logistic regressor if
( )
2019-09-26 117Machine learning and artificial neural network
Logistic regression: Model
 Interpretation of logistic regression model
, where

 can be regarded as , so that
( 𝜽 𝑻 𝒙)
( 𝜽 𝑻 𝒙) (𝜽 𝑻 𝒙)
 ( )
( 𝜽 𝑻 𝒙) (( )𝜽 𝑻 𝒙)
where
 can also be interpreted as “class estimate”.
 In both case, if , is likely to be class 1. Otherwise
class 0.
 𝑻
is called “odds” of being class 1. (Note: 𝑻
).
2019-09-26 118Machine learning and artificial neural network
Logistic regression
 Geometrical
interpretation
2019-09-26 119Machine learning and artificial neural network
Decision boundary
𝑻
Decision variable
𝑧 = 𝜽 𝑻
𝒙 + 𝜃
(odd of 𝒙 belonging to class 1)
Class 1
Class 0
Logistic regression: Cost function
 Cost function: Negative log-likelihood
 𝑻
can be interpreted as probability (likelihood)
that belongs to class 1.
 Likelihood that belongs to the target class is
given by
 Log-likelihood as an “objective” to maximize
 Can also be formulated as minimization of
2019-09-26 120Machine learning and artificial neural network
Logistic regression
 Elements of regression/classification problem
 Data (observations and their target values )
 Prediction model ( ) with parameters ( )
 Cost(loss)/Objective function to minimize/maximize
 Algorithm to efficiently obtain the optimal or a good
solution
2019-09-26 121Machine learning and artificial neural network
Data: 𝒊 𝟏
𝑵
Model with parameters: 𝑻
( 𝜽 𝑻 𝒙 )
Cost/loss:
Algorithm to min/maximize: Gradient descent
Logistic regression: Optimization
 Optimization
 contains non-linear function ( )
.

𝜽
isn’t a simple QP problem.
 We resort gradient search to get optimal (or a good) solution.
 To perform gradient search, we need gradient of the cost,
which is given by (see textbook p.68)
𝜽
 Algorithm (pseudo code)
 Initialize ( )
 ( ) ( ) ( )
for .
2019-09-26 122Machine learning and artificial neural network
“+” means hill-climbing
Logistic regression: Another cost function
 Another cost: Residual sum of square (RSS)
 𝑻
can also be interpreted as class estimate.
 Define estimation error:
 RSS as a cost to minimize
 Gradient (see textbook p.68)
𝜽
 Gradient descent
( ) ( ) ( )
for .
 What’s difference from likelihood based optimization?
2019-09-26 123Machine learning and artificial neural network
“-” means gradient descent
Performance measures of binary classifier
 Confusion matrix




2019-09-26 124Machine learning and artificial neural network
 Why do we need other
measures than accuracy?
 In some application, FN (FP)
causes more serious problem
than FP (FN)
 E.g., in medical application, you
want to make decision if a
person has tumor (P) or not (N).
It isn’t a big problem if a normal
person (without tumor) is
decided to have tumor (FP). But,
the opposite case (a person with
tumor decided as normal, FN)
may cause serious problem.
 You may want to minimize FPR
requiring TPR no less than a
certain threshold.
Performance measures of binary classifier
 ROC and AUC
 ROC: Receiver operating characteristic
 AUC: Area under (the ROC) curve
2019-09-26 125Machine learning and artificial neural network
1
1
0 FPR
= FP/(FP+TN)
TPR
=TP/(TP+FN)
AUC (면적)
결정 경계 에
따른 성능
변화
Positive(1)
Negative(0)
TP, FP down
TN, FN up
TP, FP up
TN, FN down
Homework & Computer Lab.
 Homework: 7.1, 7.2
 Practice: ML_practice2_classification_ex_190820.ipynb
2019-09-26 126Machine learning and artificial neural network
Machine Learning and Neural Network
Ch.8: Multi-task regression
and multinomial classification
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Roadmap
2019-09-26 128Machine learning and artificial neural network
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1
Ch.8 Multiclass classification
 Topics
1. Multi-task regression
2. multinomial classification
3. Generalized linear model
2019-09-26 129Machine learning and artificial neural network
Multi-task linear regression
 Linear regression with vector target
 Data: where ,
where : KxN matrix with each column being
 Linear model: 𝑻
where : Kx(M+1) matrix (including intercept)
 Define
 ( ): th row of . ( : th column of )
 : th column of
 Cost function (RSS)
𝑻
( ) ( )
2019-09-26 130Machine learning and artificial neural network
Multi-task linear regression
 Linear regression with vector target
 Cost function is a sum of RSS for each target value ( )
( )
 Optimization can be performed separately for each target
value, i.e.,
𝚯 𝜽
( )
 where 𝜽
( ) gives 𝑿 𝑿 𝑿 𝒚( )
 And 𝚯 gives 𝑿 𝑿 𝑿 𝒀
 Can be implemented using K parallel linear regressors with
scalar target value
2019-09-26 131Machine learning and artificial neural network
Multi-task linear regression
 Linear regression with vector target
 Can be implemented using K parallel linear regressors with
scalar target value
 Alternative expression of cost function
𝑻 𝑻
2019-09-26 132Machine learning and artificial neural network
Multinomial classification: two approaches
 multinomial classification can be implemented using
multiple binary classifiers.
 Two approaches (K class case)
 One against the rest:
 we use K binary classifiers, one for each class.
 Each classifier (kth classifier) compute, for example, the
likelihood
( )
of input x belonging to the kth class.
 Decide having the highest likelihood
 Pairwise binary classification + majority voting:
 we use K(K-1)/2 binary classifiers for each pair of classes.
 Decide class by taking majority of the winners.
2019-09-26 133Machine learning and artificial neural network
Multinomial logistic regression
 Data
 Data: where ,
 where is a set of categories (classes)
 ’s are categorical and discrete
 Considerations
 (single-task) logistic regressor using (integer) as target
value will not work well (because ’s are categorical, while
single-task regressor regards ’s as numerical.)
 One approach is to encode ’s to a binary vector (of size
Kx1) and use multi-task logistic regressor
2019-09-26 134Machine learning and artificial neural network
Multinomial logistic regression
 Model
 Softmax function on top of multi-task linear regressor
 Multi-task linear regressor
for
(odds of belonging to class )
Or, collectively,
 softmax function
∑
(likelihood of belonging to class )
 Note that and
2019-09-26 135Machine learning and artificial neural network
Multinomial logistic regression
 Cost/objective
 can be interpreted as Pr{ belongs to class }
 Log-likelihood can be used as the objective to maximize.
 Gradient:
𝜽
where
 Gradient search:
( ) ( )
𝜣 𝜣 𝜣( ) for .
2019-09-26 136Machine learning and artificial neural network
Since 0 ≤ 𝑆 (𝜣 𝒙 ) ≤ 1,
the direction of gradient is
either 𝒙 for 𝑘 = 𝑦 or −𝒙 for 𝑘 ≠ 𝑦
Multinomial logistic regression: more issues
 One hot encoding
 One hot encoding is a mapping of an integer
to a binary vector .., such that ,
i.e., only one element of is 1 and all others are 0.
 Example: ,
 By encoding all the target values , , .., , ,
we have
 is a KxN matrix with each column being
 Then, the gradient is given by
𝜽 ,
2019-09-26 137Machine learning and artificial neural network
Multinomial logistic regression : more issues
 Cross-entropy
 With one hot encoding: , , .., ,
 is the probability mass of
 Posterior likelihood of : , , ,
with ,
 The cross entropy between and is given by
,
 We call “cross-entropy cost”.
2019-09-26 138Machine learning and artificial neural network
Multinomial logistic regression : more issues
 Multi-task logistic regressor
 Using one hot encoding, one can
replace (for simplicity) the softmax
function with K separate logistic
sigmoid function
 K parallel logistic regressors.
 Performance ?
2019-09-26 139Machine learning and artificial neural network
s(o1) s(oK)s(o2)
1
2
K
x0 x1 x2 xM
o1 o2 oK
̂p1 ̂p2 ̂pK
 Other remarks
 Multinomial logistic regression is one-against the rest
approach.
 Once the likelihoods ’s are obtained, the class estimate is
determined by
Multinomial logistic regression: generalization
 Generalized linear model
 Linear regression and logistic
regressions can be represented by
one structure
 Consisting of an “activation
function” on top of multi-
task linear regressor
 The output can be interpreted
in various ways (e.g., as likelihoods
or as estimates of target value)
2019-09-26 140Machine learning and artificial neural network
 Also, there are many options for activation function (e.g.,
linear, sigmoid or tanh)
 If input is categorical, apply one hot encoding before
feed to regressor (input dimension must be changed too)
Multinomial logistic regression: generalization
 Generalized linear model
 Regularization can also be applied if desired by defining
cost with penalty
𝐅
𝟐
 where
 For linear regression:
 For logistic regression:
 basically regards the input and output as numerical. So,
if you deal with categorical values, you need apply one
hot encoding first.
2019-09-26 141Machine learning and artificial neural network
Homework & Computer Lab.
 실습: ML_practice2_classification_ex_190820.ipynb
2019-09-26 142Machine learning and artificial neural network
Machine Learning and Neural Network
Ch.9: Artificial neural network
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Ch.9 Artificial neural network
 Topics
1. Perceptron and artificial neural network (NN)
2. Neural network model
3. Training NN: backpropagation
4. Some issues on NN
 Convergence to local minima
 Overfitting
 Vanishing gradient problem
5. Practical considerations (building and training NN)
2019-09-26 144Machine learning and artificial neural network
Roadmap
2019-09-26 145Machine learning and artificial neural network
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1
ANN: Perceptron
 Perceptron
 It is an array of neurons interconnected, exactly the
same as in generalized linear models
 It was suggested mimicking biological neuron
2019-09-26 146Machine learning and artificial neural network
Biological neuron
source: https://en.wikipedia.org/wiki/Biological_neuron_model
Regression model
(Artificial neuron)
f(net)
Neuron
Input nodes
(dendrites)
Output nodes
(axon terminal)
Activation
function
(synaptic)
Weights
x0 x1 x2 xM
0 1 2 M
net T
x
y
ANN: Perceptron
 Perceptron
 Multi-task regression model is an horizontal array of
artificial neurons, with either combined activation or
separate activation
2019-09-26 147Machine learning and artificial neural network
̂y1
f(o1, o2,…, oK)
̂y2 ̂yK
1
2
K
x0 x1 x2 xM
o1 o2 oK
with combined activation
s(o1) s(oK)s(o2)
1
2
K
x0 x1 x2 xM
o1 o2 oK
̂p1 ̂p2 ̂pK
with separate activation
ANN: Multi-layer Perceptron
 Multi-layer Perceptron
 Consists of multiple layers of
multi-task regressors vertically
stacked
 Output of one layer is fed to the
input of the next layer.
 Number of layers and number of
neurons per layer can be
arbitrarily set
 Non-linear activation function
make it different from single-
layer (linear) model, i.e., it makes
the model non-linear
 Can be used for regression and
classification
2019-09-26 148Machine learning and artificial neural network
ANN: Multi-layer Perceptron
 Operations
 Feedforward (prediction phase):
For given input and the
current parameter , it
produce an output
 Feedback (training phase): For
each input and target vector
, the parameter ’s are
updated
 Gradient search is used for
some optimality criteria
2019-09-26 149Machine learning and artificial neural network
ANN: Multi-layer Perceptron
 Structure definition
 Number of layers:
 Number of neurons per layer:
 Full connection assumed
 Signals and parameters
 Input:
 Target vector:
 Weight matrix: ( )
 Hidden layer output:
( )
 Final output: ( )
2019-09-26 150Machine learning and artificial neural network
ANN: Multi-layer Perceptron
 Feedforward (prediction)
From to
1) ( ) ( ) ( )
( ( )
)
2) ( ) ( ) ( ( ) )
 More simply, ( ) ( ) ( )
 ( ) is 1-augmented version of ( )
 ( ) is matrix
including “intercept”
 activation function is applied
to each element of ( )
2019-09-26 151Machine learning and artificial neural network
ANN: Multi-layer Perceptron
 Feedback (training)
 Assume training is performed in per-sample basis, i.e., SGD
 Cost function (RSS):
( , ,…, ) ( ) ( ) ( )
 Cross-entropy can also be used as cost (not covered here)
 To train the model, we need 𝑾( ) for
 Top layer is easy: 𝑾( ) ( ) ( )
( )
( ) ,
where ( )
( ) ( )
and
( )
( )
( )
 Layer below ? We need to apply chain rule
 The problem, however, is not as simple as you expect.
See textbook, section 9.3
2019-09-26 152Machine learning and artificial neural network
ANN: Multi-layer Perceptron
 Feedback (training)
 The training starts from top layer and run through
downward, one by one.
 Training: From to :
( ) ( ) ( )
with ( )
𝑾( )
( , ,…, )
where, by applying chain rule (see textbook p.81-82)
( ) ( ) ( )
( ) ( ) ( )
( )
 We call it “backpropagation (BP)” as it is performed
backward (downward), opposite to feedforward operation.
2019-09-26 153Machine learning and artificial neural network
ANN: Multi-layer Perceptron
 Back-propagation (BP) algorithm
 From to : ( ) ( ) ( )
where
∆𝑤
( )
= δ
( )
z
( )
δ
( )
= 𝑓′ 𝑎
( )
𝑦 − z
( )
δ
( )
= 𝑓′ 𝑎 ∑ 𝑤 δ
2019-09-26 154Machine learning and artificial neural network
Vector-matrix form
∆𝑾( )
= 𝛅 𝐳
𝛅 = 𝑓′ 𝒂( )
⨀ 𝒚 − 𝒛( )
𝛅 = 𝑓′ 𝒂( )
⨀ 𝑾( )
𝛅( )
⨀: element-wise product
ANN: Multi-layer Perceptron
 Activation function
 Except for the top (output) layer, activation function
should be non-linear for a hidden layer to be effective.
 Any monotonically increasing function can be used.
 They are typically s-shaped, e.g., logistic sigmoid or tanh
 ReLU or leaky ReLU are widely used recently.
ReLU:
Leaky ReLU: with
2019-09-26 155Machine learning and artificial neural network
ANN Issues: Convergence to local minima
 Convergence to local minima
 NN is a non-linear model and the cost J is not convex.
 The number of minima/maxima is not known
 Gradient search does not guarantee the convergence to
the global minimum
 The local minima we get depends on the initial setting of W
 There are no systematic approaches to achieve global
minimum yet.
 Simulated annealing, Genetic algorithms were proposed as
heuristic solutions
2019-09-26 156Machine learning and artificial neural network
ANN Issues: Overfitting
 Overfitting
 NN model has so many parameters (W(1),W(2),…,W(L))
 Deep NN is especially the case
 Similar to linear model, where N << M, NN with too much
parameters may be easily overfitted to the training data
 Three approaches to relieve overfitting
 Noise injection: Increasing the number of data by adding
noise  reduce generalization error (to some extent)
 Regularization technique: add L1/L2 penalty to the cost
function  similar impact to noise injection
 Dropout ?
2019-09-26 157Machine learning and artificial neural network
ANN Issues: Overfitting
 Dropout: avoiding co-adaptation of neurons
 Useful for Convolutional NN (for image)
 At each training phase (for a batch of samples), we
randomly select a portion of neurons (with probability p)
and disable them
 Can avoid many neurons co-adapted to each other (avoid
many neurons activated to similar data)
 Many NN packages support dropout layer as an option
2019-09-26 158Machine learning and artificial neural network
ANN Issues: Vanishing gradient
 Vanishing gradient problem
 This is also a typical problem in deep neural network.
 BP (training) starts from top layer and run through
downward one-by-one, recursively.
 Recall: ( )
, where ( ) ( ) ( )
where
 With sigmoid function, (it’s mostly close to 0)
 ’s are computed recursively
 As BP run through downward, gets smaller and smaller,
and so does ( )
 vanishing gradient
 If NN has many layers, effective learning rate in bottom
layers gets very small, i.e., neurons in bottom layers are
hardly trained  take to much time to be trained
2019-09-26 159Machine learning and artificial neural network
ANN Issues: Vanishing gradient
 Vanishing gradient problem
 Using ReLU or leaky ReLU may help alleviate vanishing
gradient problem.
 Unsupervised learning based pre-training of bottom layers
was proposed, though not so widely used recently.
2019-09-26 160Machine learning and artificial neural network
ANN Issues: Building NN model
 To build a neural network model, you need to
consider first
 Input and output dimension?
 How many layers? ( )
 How many neurons for each layer? ( )
 Activation function ? (sigmoid, tanh, ReLU or leaky ReLU)
 Dropout layer? With what probability? (p)
 What cost function ? (RSS or cross-entropy)
 Which optimizer to use? (simple SGD w/wo momentum .. )
 Batch size?
 Regression or classification ? (For regression, top layer
activation is typically set linear)
2019-09-26 161Machine learning and artificial neural network
ANN Issues: Training NN model
 When training NN, you need to check
 Overfitting (compare performance with training and test
data while training the model)
 Vanishing gradient (check if training takes too much time)
 Convergence to bad local minima (you can train many times
or train multiple instances in parallel with different initial
values)
2019-09-26 162Machine learning and artificial neural network
Computer Lab.
 Practice: ML_practice3_NN_ex.ipynb
Machine Learning and Neural Network
Ch.10: Recurrent neural network (RNN)
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Ch.10 Recurrent neural network
 Topics
1. Model structure and operation.
2. RNN Training: backpropagation through time (BPTT)
3. LSTM (long/short term memory)
2019-09-26 164Machine learning and artificial neural network
Roadmap
2019-09-26 165Machine learning and artificial neural network
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1
RNN: Recurrent neural network
 Features
 Recurrence means output fed
back to input
 Necessarily, the input is a
time-series data
 Example on the right consist of
two layers
 The hidden layer output is fed
back to input with one sample
delay (D)
2019-09-26 166Machine learning and artificial neural network
 Layer 2 has no feedback loop (conventional NN layer)
 Main applications are speech recognition, language
modelling (machine translation, sentence completion),
where data is given as time series
𝒉( )
= 𝑓 𝑼𝒙( )
+ 𝑽𝒉( )
𝒚( )
= 𝑓 𝑾𝒉( )
RNN: Recurrent neural network
 Model
 Consider 1-layer RNN for simplicity
 Input: ( )
(time-series)
 Output (state): ( )
(time-series)
 Feedforward operation:
( ) ( ) ( )
 Output depends on both ( ) and previous output (state) ( )
 Feedforward operation can also be expressed as
( ) ( ) ( )
 Initial condition: Asuume ( )
 ( ) ( )
2019-09-26 167Machine learning and artificial neural network
f(·)
h(t)
x(t)
h(t-1)
U
V
(a) RNN with a loop
D
g(t)
RNN: Recurrent neural network
 Unfolded model
2019-09-26 168Machine learning and artificial neural network
f(·)
h(t)
x(t)
h(t-1)
U
V
(a) RNN with a loop
D
g(t)
RNN: Training
 RNN Training (textbook 10.2)
 Cost function:
( ) ( ) ( ) ( ( ): target vector)
 Gradient can be obtained by applying chain rule.
 Gradient w.r.t. (at time )
𝑽
( )
𝑽
where
( )
𝑽
( )
𝒈( )
𝒈( )
𝒈( )
𝒈( )
𝑽
𝒈( )
𝒈( )
𝒈( )
𝒈( )
𝒈( )
𝒈( )
( )
𝒈( )
𝒈( )
𝑽
 With
𝒈( )
𝒈( )
( )
and
( )
𝒈( )
𝒈( )
𝑽
( ) ( )
,
( )
𝑽
( ) ( ) ( )
2019-09-26 169Machine learning and artificial neural network
RNN: Training
 RNN Training (textbook 10.2)
 Gradient of (at time )
𝑼
( )
𝑼
where
( )
𝑼
( )
𝒈( )
𝒈( )
𝒈( )
𝒈( )
𝑼
 In the same way as for the gradient w.r.t. , we have
( )
𝑼
( ) ( ) ( )
 To update and , we need perform BP through time
(from to ).
 We call it backpropagation through time (BPTT).
2019-09-26 170Machine learning and artificial neural network
RNN: Training
 Vanishing and exploding gradient
 Looking at
𝑼
(and also
𝑽
), the gradient contains
( )
 where for any activation function we considered,
( ) (matrix norm)
 Assuming , we have
( )
(mostly , why?)
 As , l.h.s. goes to 0 if (vanishing gradient) or
to  if ( )
/( )
(exploding gradient)
 The latter seldom occurs.
2019-09-26 171Machine learning and artificial neural network
RNN: Training
 Forgets past inputs/outputs quickly
 We also have for
𝑼
( ) ( ) ( ) ( ) ( )
 RNN is supposed to memorize the past inputs (in the system
state) to deal with time-series data.
 With , however, as gets large.
 This means the system forgets past inputs quickly.
 There are many examples where we need long term memory
to correctly catch what exactly the sentence means.
2019-09-26 172Machine learning and artificial neural network
RNN: Training
 RNN summary
 Due to recurrence nature, RNN training requires
backpropagation through time (to t=1)
 If T gets large, the gradient may vanish or explode.  the
training rule should be carefully tuned.
 As in most case, vanishing gradient occurs more
frequently than exploding gradient
 One solution to avoid vanishing/exploding gradient problem
is to perform BPTT only for finite length of time window.
(unfolded model of finite length)
2019-09-26 173Machine learning and artificial neural network
RNN: LSTM
 Long/Short term memory (LSTM)
 a variant of RNN (proposed in 1997) to solve (partly) the
vanishing gradient and to make system memory longer.
 Vanilla RNN and LSTM
 3 gates (forgetting/input/output gate) + main path
 Two separate states: ( )
and (𝒕)
2019-09-26 174Machine learning and artificial neural network
RNN: LSTM
 LSTM operation
 Gating function:
( ) ( ) ( ) (forget gate)
( ) ( ) ( )
(input gate)
( ) ( ) ( ) (output gate)
 Cell state update:
(𝒕) ( ) ( ) ( ) ( ) ( ) (long-term memory)
( ) ( ) ( )
(short-term memory)
2019-09-26 175Machine learning and artificial neural network
RNN: LSTM
 LSTM operation
 Cell state update:
(𝒕) ( ) ( ) ( ) ( ) ( ) (long-term memory)
( ) ( ) ( )
(short-term memory, final output)
 When ignoring the gating function, ( )
is simply a sum of ( )
and
the new input ( ) ( )  can keep long-term memory

( )
select important features from previous state ( ),
which comprise a part of the current output ( )
.

( )
select important features from the new input (output of
vanilla RNN), which comprise another part of the current
output)

( )
controls what features in ( )
to pass to output ( )
.
2019-09-26 176Machine learning and artificial neural network
RNN: LSTM
 LSTM operation
 Gating function:
( ) ( ) ( )
(forget gate)
( ) ( ) ( ) (input gate)
( ) ( ) ( ) (output gate)
 Parameters of three gates ( , , , , , ) are
obtained through BPTT, too.
 i.e., LSTM learns from the data what features to select
from ( )
(long-term memory) and from ( ) ( )
.
 Also it learns what features in ( ) to pass to the final
output ( )
.
2019-09-26 177Machine learning and artificial neural network
RNN: Building RNN/LSTM model
 Unfolded RNN/LSTM model
 You can add NN layer on top of RNN/LSTM cell.
2019-09-26 178Machine learning and artificial neural network
RNN/LSTM
cell K
RNN/LSTM
cell K-1
RNN/LSTM
cell 2
RNN/LSTM
cell 1
y(t)
y(t-1)
y(t-K+1)
y(t-K)
x(t)
x(t-1)
x(t-K+1)
x(t-K)
DD D
Computer Lab.
 Practice 1: ML_practice4_RNN_seq_pred.ipynb
 Practice 2: ML_practice5_RNN_hihello.ipynb
Machine Learning and Neural Network
Ch.11: Convolutional neural network (CNN)
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Ch.11 Convolutional neural network
 Topics
1. Features of CNN
2. CNN Model
 Convolution sublayer
 Activation function sublayer
 Pooling sublayer
3. CNN Training
2019-09-26 180Machine learning and artificial neural network
Roadmap
2019-09-26 181Machine learning and artificial neural network
Ch.4
Linear
Regression
Ch.6
Ridge/Lasso
regression
Ch.7
Logistic
regression
Ch.8
Multi-task
regression
Ch.9
Neural
Network
Ch.10
Recurrent
NN
Ch.11
Convolutional
NN
D D D
x(t)
ŷ(t)
h(t)
h(t-1)
Layer 2
Layer 1
CNN: Convolutional neural network
 Image/Vision classification and object detection
 An image has 2D(matrix) or 3D(tensor) structure (i.e., RGB)
 Information is contained in a pixel, an element of a matrix
(2D image) or a tensor (2D images for RGB or 2D images
captured with 2 cameras).
 Nearby pixels (values) are highly correlated
 patterns in an image can be identified by the correlations
between nearby pixels
 nearby pixels must be processed as a chunk
 Identifying patterns in an image is “translation invariant”
and “size invariant”. (we can identify same patterns
wherever it is located and whatever its size is).
 Sometimes, it should also be rotation invariant.
2019-09-26 182Machine learning and artificial neural network
CNN: Convolutional neural network
 CNN for image/vision data
 CNN is a special NN designed for image/vision data.
 Can be used for image classification, object detection,
depth estimation, etc.
 It processes a chunk of nearby pixels simultaneously
(receptive field)
 Will see how it provides object (pattern) detection with
translation invariance.
 Size invariance can be provided by multi-layer structure
 Rotation invariance?
2019-09-26 183Machine learning and artificial neural network
CNN Model
 (example) configuration of CNN
 Two convolution NN layers and 3 (FC) NN layers.
 Convolution NN layers are divided into sublayers:
convolution sublayer (denoted by CX) and pooling sublayer
(denoted by SX)
 (FC) NNs layers are C5, F6 and output (C5 looks like FC NN)
2019-09-26 184Machine learning and artificial neural network
source: Proc. of IEEE, Nov. 1998 by Y. LeCun, et.al.
CNN Model – Convolution layer
 CNN model (convolution layer)
 For convenience, we divide it into 3 sublayers.
 Convolution sublayer
 Activation function sublayer
 Pooling sublayer
 Activation function sublayer is the same as in conventional NN
 Dropout can also be applied as in fully connected NN layer
2019-09-26 185Machine learning and artificial neural network
CNN Model – Convolution layer
 CNN model (convolution layer)
 Conventional NN layer has 1-dimensional array of neurons,
while CNN layer has 3-dimensional array (width, height and
depth), where depth index is called “channel”
 Input to a CNN layer is also 3-dimensional, e.g., 2D images
with RGB (3 channels)
 Denote 3-d input and output of th CNN layer as ( )
and
( )
, where and are channel index.
2019-09-26 186Machine learning and artificial neural network
CNN Model – Convolution layer
 CNN model (convolution layer)
 Operations of three sublayers are
 Convolution sublayer ------------:
( ) ( ) ( )( )
 Activation function sublayer ---:
( ) ( )
 Pooling sublayer -----------------:
( ) ( )
 Input and output size are the same only for AF sublayer.
Other two sublayer has different input and output size.
2019-09-26 187Machine learning and artificial neural network
CNN Model – Convolution layer
 Convolution sublayer

( ) ( ) ( )( )

( )
is weight matrix (filter) between th channel of the
input and th channel of the output.
 is 2-d convolution, with which th element of ( )
is
given by ,
( )
,
( )
,
( )
( , )∈ ( , )
 is the “receptive field” of th neuron
2019-09-26 188Machine learning and artificial neural network
2-d array of neurons
of jth output channel
2-d array of input signal
of ith input channel
CNN Model – Convolution layer
 Convolution sublayer
 Each filter responses to a certain pattern within a
receptive fields on the input.
 Filter examples: three filters of size 5x5 response to
different patterns (diamond, T and diagonal, respectively)
 The filter coefficients are obtained through CNN training
and, in general, they are real values.
2019-09-26 189Machine learning and artificial neural network
CNN Model – Convolution layer
 Convolution sublayer example (2 input ch., 3 output ch.)
 All the neurons of a channel share the same weight matrices
 A channel (2D array) is a feature map containing information
of a (combination of) specific pattern(s) defined by weight
matrices; (information on location and existence)
2019-09-26 190Machine learning and artificial neural network
CNN Model – Convolution layer
 Convolution sublayer
 Configuration parameters
• : stride, ( ) : size of the weight matrix (2-d)
• ( ), ( ): size of input and output (3-d)
 , ( ) and ( ) must be set properly
 The number of weight matrices (filters) to train is
 In general, , , while ,
2019-09-26 191Machine learning and artificial neural network
2-d array of neurons
of jth output channel
2-d array of input signal
of ith input channel
CNN Model – Convolution layer
 Activation function sublayer

( ) ( )
 Output of convolution sublayer, ( )
, is then passed through
an activation function.
 ReLU or leaky ReLU are typically used.
 The output ( )
has the same size as that of the input.
2019-09-26 192Machine learning and artificial neural network
CNN Model – Convolution layer
 Pooling sublayer

( ) ( )
 Pooling sublayer down sample the sublayer input, ( )
.
 While doing so, it also summarizes the data too.
 Let be the down-sample ratio. Each channel of input is
partitioned into areas (pooling area), in which
array of numbers are summarized into a scalar.
 Two types: max-pooing (takes maximum value) and average-
pooling (take average of values)  output size is
of input size
2019-09-26 193Machine learning and artificial neural network
CNN Model – Convolution layer
 Pooling sublayer
 Pooling operation can be expressed as
,
( ) ( , )∈ ( , )
,
( )
,
( )
( , )∈ ( , )
 is the pooling area of th output.
 Pooling reduces computational burden, e.g., with , the
number of parameters to train is reduced by ¼.
 If is too large, however, important information can be lost.
 Better to apply pooling
multiple times with
small .
2019-09-26 194Machine learning and artificial neural network
CNN training
 CNN training
 The parameters to optimize is the weight matrices, 𝑾
( )
’s
for ,
 Similar to conventional NN, we apply chain rule to compute
gradient w.r.t. ( )
 Differences from conventional NN
1. 3-D (cubic) arrays of neurons
2. partial connection & weight sharing in conv. sublayer
3. passing gradient through pooling sublayer.
 See textbook section 11.3 for detail
2019-09-26 195Machine learning and artificial neural network
CNN training
 Improving performance of CNN
 Apply dropout to avoid co-adaptation between channels
 Data normalization: adjust mean (brightness) and variance
(contrast) of image to make them fall within predefined ranges
 Batch normalization: normalize data for each batch at each layer
 Data augmentation: increase data set by resizing and/or
rotating the original image  size/rotation invariance
2019-09-26 196Machine learning and artificial neural network
 Practice: ML_practice6_CNN_190820.ipynb
2019-09-26 197Machine learning and artificial neural network
Computer Lab.
Machine Learning and Neural Network
Ch.12/13: Unsupervised learning:
Clustering and data visualization
Seokhyun Yoon, Electronics Eng., Dankook Uinversity
Ch.12/13 Clustering and data visualization
 Topics
1. Clustering
 Partitioning (centroid) based clustering: k-means algorithm
 Hierarchical (connectivity based) clustering and dendrogram
 Density based clustering
 Distribution based clustering
2. EM algorithm for Gaussian Mixture Model (Ch.13)
3. Data visualization using non-linear mapping: t-SNE
2019-09-26 199Machine learning and artificial neural network
Clustering and data visualization
 Clustering
 Data without label: where
 Objective is to divide data into a set of groups
based on some similarity measures
 Need to devise procedures to efficiently group data
 Data (distribution) visualization to check clusters
 Typical similarity measures:
 Euclidian distance:
 Correlation:
𝒙 𝒙
𝒙 𝒙
2019-09-26 200Machine learning and artificial neural network
Clustering and data visualization
 Four approaches to clustering
 Partitioning (centroid) based clustering: k-means
 Hierarchical (connectivity based) clustering
 Density based clustering
 Distribution base clustering: Gaussian Mixture Model and
EM algorithm (ch.13)
2019-09-26 201Machine learning and artificial neural network
Partitioning (centroid) based Clustering: k-means
 Partitioning (centroid) based clustering
2019-09-26 202Machine learning and artificial neural network
 The feature space is
partitioned into Voronoi
regions, where each region
is represented by a
centroid.
 Based on Euclidian distance
measure, the points in a
Voronoi region are those
closest to that centroid
 k-means (Lloyd) algorithm
searches for centroids for
pre-defined number of
regions to partition.
Partitioning (centroid) based Clustering: k-means
 K-means clustering (Lloyd’s algorithm)
 Input: , : the number of clusters to find
 Initialization: randomly select samples to use them as
centroids,
1) Determine class members :
 Set
 For all samples , do
 ∗
∈{ , ,…, }
∗ ∗
2) Update centroid:
| | ∈ (mean of its members)
 Repeat 1) and 2) many times until doesn’t change any more
 Output: , cluster label for all
2019-09-26 203Machine learning and artificial neural network
Partitioning (centroid) based Clustering: k-means
 Partitioning (centroid) based clustering
 K-means algorithm has been originally proposed for vector
quantization
 The clusters found can be quite different from our
expectation, especially when the size of the true clusters
are quite different
2019-09-26 204Machine learning and artificial neural network
Source: https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm
Hierarchical (connectivity based) clustering
 Hierarchical clustering
 cluster hierarchy represented by a dendrogram (a binary
tree representing similarity between clusters).
 In the tree, a node is a cluster and a leaf node is a sample
 Two approaches to build dendrogram: top-down (divisive) or
bottom-up (agglomerative)
2019-09-26 205Machine learning and artificial neural network
Node
Leaf node
Root node
Samples, labelled by (BRCA) tumor category
Features
[gene name]
Hierarchical (connectivity based) clustering
 Bottom-up (agglomerative) approach
 Initially, each sample is set as a cluster (leaf node) having only
one member.
1) Compute “inter-cluster distances” for every pair of clusters
(nodes without parent).
2) Select the pair with smallest distance and merge them to one.
(add a node in the tree connecting the two nodes)
 Repeat 1) and 2) until only one cluster is left
2019-09-26 206Machine learning and artificial neural network
Source:
https://www.researchgate.net/
publication/273456906_Cluster
_Analysis_to_Understand_Socio
-Ecological_Systems_A_Guideline
/figures?lo=1
Hierarchical (connectivity based) clustering
 Bottom-up (agglomerative) approach
2019-09-26 207Machine learning and artificial neural network
 Inter-cluster distances:
distance between two clusters
 It can be defined as
 minimum (single-linkage)
 average (average-linkage)
 maximum (complete-linkage)
 of distances between every pair
of members (one from each
cluster)
Hierarchical (connectivity based) clustering
 Top-down (divisive) approach
 Initially, we have only one cluster having all the sample as
its members. (root node)
1) Select the cluster having highest “intra-cluster distance”
(for example)
2) Apply k-means clustering to divide it into two.
 Repeat 1) and 2) until every cluster have only one member.
 Another name of this is “hierarchical k-means”
2019-09-26 208Machine learning and artificial neural network
Density based clustering
 Density based clustering
 A cluster is defined as a set of samples that lie within a
relatively dense area.
 Clusters are divided by sparse area.
 Useful when clusters are not centralized (not radially
distributed)
 Two well known algorithm: DBSCAN and OPTICS
2019-09-26 209Machine learning and artificial neural network
Source: https://untitledtblog.tistory.com/146
Density based clustering
 Density based clustering: DBSCAN
 Two parameters: (dist. threshold) and (# of points)
 Definition (core point): It is a point from which there are at
least points within a distance .
 First, divide all the points into core and non-core points.
 Assign cluster # to core points
1) Select a core point x of which cluster is not assigned yet.
2) Find all the core points that can be connected within a distance
to each other  assign a cluster # to these core point(s)
3) Repeat 1) and 2) to find all the core point clusters
 Assign cluster # to non-core points
1) For all non-core points, find the closest core point within the
distance and set its cluster to the cluster # of that core point.
2) If there is no core point within , it is simply regarded as outlier.
2019-09-26 210Machine learning and artificial neural network
Distribution based clustering
 Distribution based clustering: Mixture model
 Use a PDF model (with parameters) to approximate
probability distribution of clusters
 The data distribution is modelled by a mixture of the PDFs
 A well-known, mathematically tractable one is Gaussian
mixture model (GMM), of which the data distribution is
modelled by
where is cluster index and is the number of clusters
 The objective is to find optimal model parameters ,
for that best fit to given data set.
2019-09-26 211Machine learning and artificial neural network
Gaussian mixture model & EM algorithm
 Gaussian mixture model
|
where is cluster index and is the number of clusters
is a latent variable (은닉 변수)
 The objective is to find optimal model parameters ,
for that best fit to given data set.
 Issues
 May use likelihood as objective function
|
( consists of { }’s)
 Not easy to maximize as contains summation due to
the latent variable
2019-09-26 212Machine learning and artificial neural network
Gaussian mixture model & EM algorithm
 EM algorithm (in general)
 Use conditional likelihood given , i.e., assume
(the cluster of each sample ) is fixed
 Define conditional likelihood
|
 With this, we iteratively find and
 Steps
 Initialize (𝟎) and do the following while not converge
1) E-step:
𝒛|𝑿,𝜽 𝒕 𝒛
𝒕
2) M-step: ( )
𝜽
( )
2019-09-26 213Machine learning and artificial neural network
Gaussian mixture model & EM algorithm
 EM algorithm for Gaussian mixture model
 Conditional likelihood:
Steps
 Input:
 Initialize (𝟎)
 Do the following while not converge
1) E-step: ( ) 𝒙 ; 𝝁
( )
,𝑪
( )
∑ 𝒙 ; 𝝁
( )
,𝑪
( )
2) M-step:
( ) ∑
( )
∑ ∑
( )
( ) ∑
( )
𝒙
∑
( )
( ) ∑
( )
𝒙 𝝁 𝒙 𝝁
∑
( )
(See textbook section 13.2 for detail)
2019-09-26 214Machine learning and artificial neural network
Gaussian mixture model & EM algorithm
 Clustering with GMM
 Note
 The number of clusters must be fixed a priori.
 Variational EM can find a good number for implicitly.
 See “C. M. Bishop, Pattern Recognition and Machine
Learning, Springer” for variational EM
2019-09-26 215Machine learning and artificial neural network
Source: https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm
Dimension reduction
and data visualization
Non-linear feature dimension reduction: t-SNE
 Data (distribution) visualization
 Data visualization gives us a lot of information on the data,
its shape of distributions, the number of separable
clusters, and so on.
 One can also check if clustering is done properly and if
there is any outliers or not.
 Linear dimension reduction (PCA) is effective if the number
of clusters or the original feature dimension is small
enough.
 We discuss a non-linear dimension reduction technique,
t-distributed stochastic neighbor embedding (t-SNE).
2019-09-26 217Machine learning and artificial neural network
Non-linear feature dimension reduction: t-SNE
 Requirement in general
 points close to each other in the original space must also
be close together in the new (low dimensional) space.
 The local structure (manifolds) in the original space is kept
in the new space with as little distortion as possible.
 Characteristics of t-SNE
 It’s a non-linear mapping
 Direct mapping: x in the original space  z in the new space,
obtained by solving an optimization problem
 If some new data is added, we need to perform
optimization again and the new mapping will be different
from the previous one.
 An upgraded version of SNE
2019-09-26 218Machine learning and artificial neural network
Non-linear feature dimension reduction: t-SNE
 Elements
 Pairwise similarity in the original space:
 Pairwise similarity in the new space:
 Cost function:
 Definition
 Given data points , let be the point wise
mapping of in the new space.

𝒙 𝒙 /
∑ 𝒙 𝒙 /, :

𝒛 𝒛
∑ 𝒛 𝒛, :
𝒛 𝒛
∑ 𝒛 𝒛
 Both and are valid PMFs.
2019-09-26 219Machine learning and artificial neural network
Non-linear feature dimension reduction: t-SNE
 Cost function: Kullback-Leibler divergence(KLD)
 Cost = KLD between and
(𝒁), :
 and hold if and are valid PMFs
 Optimization
 We want to find that minimize .
 Apply gradient descent, for which the gradient of
w.r.t. is given by
𝒛
(𝒁)
𝒛 𝒛
 More tricks were applied (see the original paper)
2019-09-26 220Machine learning and artificial neural network
Non-linear feature dimension reduction: t-SNE
 Note
𝒛
(𝒁)
𝒛 𝒛
 Let be the original space and be the new space
 (direction of movement of ) = in is either toward
or the opposite
 The sign is determined by , i.e., it is toward if
(similarity in < that in or distance in > that in )
 The actual movement is given by the sum for all  make
and as close as possible

𝒛 𝒛
can be regarded as the rate of movement
 The rate of movement is large if and are close together
and vice versa  try to keep focused on local structure
2019-09-26 221Machine learning and artificial neural network
Non-linear feature dimension reduction: t-SNE
 Comparison: PCA versus t-SNE
 400 dimensional features mapped to 2-dimensional features
2019-09-26 222Machine learning and artificial neural network
Non-linear feature dimension reduction: t-SNE
 Perplexity: setting
 Perplexity is defined for a point as ( )
,
where | |
with |
𝒙 𝒙 /
∑ 𝒙 𝒙 /
 We make the perplexity roughly the same , i.e.,
 set smaller in dense region (many points nearby)
 set larger in sparse region (few points nearby)
 In this way, the effective number of points nearby is
made roughly the same
 Binary search can be used to find
 Typical value of perplexity is 5~50
2019-09-26 223Machine learning and artificial neural network
 Practice: ML_practice7_clustering.ipynb
2019-09-26 224Machine learning and artificial neural network
Computer Lab.

More Related Content

What's hot

Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
Rohit Kumar
 
Introduction to AI & ML
Introduction to AI & MLIntroduction to AI & ML
Introduction to AI & ML
Mandy Sidana
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Mohammed Bennamoun
 
Perceptron
PerceptronPerceptron
Perceptron
Nagarajan
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Koundinya Desiraju
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
MachinePulse
 
Deep learning ppt
Deep learning pptDeep learning ppt
Deep learning ppt
BalneSridevi
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
NAGUR SHAREEF SHAIK
 
Machine Learning by Rj
Machine Learning by RjMachine Learning by Rj
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesRui Pedro Paiva
 
Machine learning
Machine learningMachine learning
Machine learning
Dr Geetha Mohan
 
Lecture 9 Perceptron
Lecture 9 PerceptronLecture 9 Perceptron
Lecture 9 Perceptron
Marina Santini
 
Fundamental, An Introduction to Neural Networks
Fundamental, An Introduction to Neural NetworksFundamental, An Introduction to Neural Networks
Fundamental, An Introduction to Neural Networks
Nelson Piedra
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
Sunwoo Kim
 
Neural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's PerceptronNeural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's Perceptron
Mostafa G. M. Mostafa
 
Summer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of LondonSummer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of London
Yash Khanna
 
AI vs Machine Learning vs Deep Learning | Machine Learning Training with Pyth...
AI vs Machine Learning vs Deep Learning | Machine Learning Training with Pyth...AI vs Machine Learning vs Deep Learning | Machine Learning Training with Pyth...
AI vs Machine Learning vs Deep Learning | Machine Learning Training with Pyth...
Edureka!
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
PavanpreetKaur1
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Simplilearn
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
ASHOK KUMAR
 

What's hot (20)

Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
 
Introduction to AI & ML
Introduction to AI & MLIntroduction to AI & ML
Introduction to AI & ML
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
 
Perceptron
PerceptronPerceptron
Perceptron
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Deep learning ppt
Deep learning pptDeep learning ppt
Deep learning ppt
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
 
Machine Learning by Rj
Machine Learning by RjMachine Learning by Rj
Machine Learning by Rj
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
 
Machine learning
Machine learningMachine learning
Machine learning
 
Lecture 9 Perceptron
Lecture 9 PerceptronLecture 9 Perceptron
Lecture 9 Perceptron
 
Fundamental, An Introduction to Neural Networks
Fundamental, An Introduction to Neural NetworksFundamental, An Introduction to Neural Networks
Fundamental, An Introduction to Neural Networks
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
 
Neural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's PerceptronNeural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's Perceptron
 
Summer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of LondonSummer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of London
 
AI vs Machine Learning vs Deep Learning | Machine Learning Training with Pyth...
AI vs Machine Learning vs Deep Learning | Machine Learning Training with Pyth...AI vs Machine Learning vs Deep Learning | Machine Learning Training with Pyth...
AI vs Machine Learning vs Deep Learning | Machine Learning Training with Pyth...
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 

Similar to Machine learning and_neural_network_lecture_slide_ece_dku

Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
Dr. Abdul Ahad Abro
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
IRJET Journal
 
SVM & MLP on Matlab program
 SVM & MLP on Matlab program  SVM & MLP on Matlab program
SVM & MLP on Matlab program
Hussain Ala'a Alkabi
 
# Neural network toolbox
# Neural network toolbox # Neural network toolbox
# Neural network toolbox
VineetKumar508
 
Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra
 
Machine learning
 Machine learning Machine learning
Machine learning
Siddharth Kar
 
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmWater Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
IRJET Journal
 
Artificial Intelligence based Pattern Recognition
Artificial Intelligence based Pattern RecognitionArtificial Intelligence based Pattern Recognition
Artificial Intelligence based Pattern Recognition
Dr. Amarjeet Singh
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
NitinSharma134320
 
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
IRJET Journal
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
VenkateswaraBabuRavi
 
Visualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine LearningVisualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine Learning
IRJET Journal
 
Data Science Machine
Data Science Machine Data Science Machine
Data Science Machine
Luis Taveras EMBA, MS
 
IRJET- Machine Learning
IRJET- Machine LearningIRJET- Machine Learning
IRJET- Machine Learning
IRJET Journal
 
Computational model for artificial learning using formal concept analysis
Computational model for artificial learning using formal concept analysisComputational model for artificial learning using formal concept analysis
Computational model for artificial learning using formal concept analysisAboul Ella Hassanien
 
Machine_Learning_Co__
Machine_Learning_Co__Machine_Learning_Co__
Machine_Learning_Co__
Sitamarhi Institute of Technology
 
Survey on Artificial Neural Network Learning Technique Algorithms
Survey on Artificial Neural Network Learning Technique AlgorithmsSurvey on Artificial Neural Network Learning Technique Algorithms
Survey on Artificial Neural Network Learning Technique Algorithms
IRJET Journal
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
IAEME Publication
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET Journal
 

Similar to Machine learning and_neural_network_lecture_slide_ece_dku (20)

Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
SVM & MLP on Matlab program
 SVM & MLP on Matlab program  SVM & MLP on Matlab program
SVM & MLP on Matlab program
 
# Neural network toolbox
# Neural network toolbox # Neural network toolbox
# Neural network toolbox
 
Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_report
 
Machine learning
 Machine learning Machine learning
Machine learning
 
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmWater Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
 
Artificial Intelligence based Pattern Recognition
Artificial Intelligence based Pattern RecognitionArtificial Intelligence based Pattern Recognition
Artificial Intelligence based Pattern Recognition
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
 
Lecture-6-7.pptx
Lecture-6-7.pptxLecture-6-7.pptx
Lecture-6-7.pptx
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Visualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine LearningVisualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine Learning
 
Data Science Machine
Data Science Machine Data Science Machine
Data Science Machine
 
IRJET- Machine Learning
IRJET- Machine LearningIRJET- Machine Learning
IRJET- Machine Learning
 
Computational model for artificial learning using formal concept analysis
Computational model for artificial learning using formal concept analysisComputational model for artificial learning using formal concept analysis
Computational model for artificial learning using formal concept analysis
 
Machine_Learning_Co__
Machine_Learning_Co__Machine_Learning_Co__
Machine_Learning_Co__
 
Survey on Artificial Neural Network Learning Technique Algorithms
Survey on Artificial Neural Network Learning Technique AlgorithmsSurvey on Artificial Neural Network Learning Technique Algorithms
Survey on Artificial Neural Network Learning Technique Algorithms
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
 

Recently uploaded

Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
veerababupersonal22
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 

Recently uploaded (20)

Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 

Machine learning and_neural_network_lecture_slide_ece_dku

  • 1. Machine Learning and Neural Network Course introduction Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 2. Machine learning: Course introduction  Target audience  Senior undergraduate  First year graduate student  Prerequisite  Linear algebra (선형대수 혹은 공학수학2)  Basic probability and statistics (확률및통계학)  Basic Python programming  Textbook  기계학습과 인공신경망 개론 (Ver1.xx)  Download: https://www.slideshare.net/SeokhyunYoon1/ 2019-09-26 2Machine learning and artificial neural network
  • 3. Machine Learning and Neural Network Ch.1: Introduction Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 4. Ch.1 ML Introduction  Objective: get “feel” and terminologies 1. What is machine learning? Concept and applications 2. What problems can ML solve?  Classification, regression and clustering  Supervised and unsupervised learning 3. Key elements of ML  Data, Model and Cost 4. Design steps and issues in performance evaluation 2019-09-26 4Machine learning and artificial neural network
  • 5. Machine learning: Introduction  Major applications  Pattern classification: Character/Speech recognition  Object detection and tracking  Time-series prediction (Stock price/market prediction, weather forecast)  Sentence completion and language translation  … and much more  Problems in machine learning  Classification  Regression  Clustering 2019-09-26 5Machine learning and artificial neural network
  • 6. Machine learning: Introduction Related fields 2019-09-26 6Machine learning and artificial neural network Machine learning Probability and statistics Data science Cognitive science Artificial Intelligence Computer science Big data Data mining Linguistics Psychology, neuro-science Neural Network
  • 7. Machine learning: Introduction  Elements of machine learning in classification and regression problem  Prediction model ( ) with parameters ( )  Data (observations and their target values )  Cost(loss)/Objective function to minimize/maximize  Algorithm to efficiently obtain the optimal or a good solution 2019-09-26 7Machine learning and artificial neural network Data: 𝒊 𝟏 𝑵 Model with parameters: Cost/loss: Algorithm to solve ∗ 𝜽
  • 8. Machine learning: Introduction  Machine learning process 2019-09-26 8Machine learning and artificial neural network Existing data 𝒊 𝟏 𝑵 Machine learning Algorithm ∗ 𝜽 Model (with parameters) ∗ New data Prediction ∗
  • 9. Machine learning: Introduction  Classification and regression 2019-09-26 9Machine learning and artificial neural network cat (Smiling) X y 1 2 3 4 5 6 1 2 3 X = 2.5, y=? X = 6.0, y=? Existing dataNew data
  • 10. Machine learning: Introduction  Given an observation (which can be a vector, a matrix (image) or a tensor)  Classification determines its class among a set of classes  Regression estimates/predicts unobserved variables  Regression can be a prediction of future trend or interpolation of some missing information  Classification vs. regression  In classification, is a discrete, categorical value drawn from a finite set  In regression, is a numerical value 2019-09-26 10Machine learning and artificial neural network
  • 11. Machine learning: Introduction  Machine learning is all about to find and  How to find the best or, at least, a good ?  Given , how to find the best or, at least, a good ?  The best or a good for what and in what sense ?  Why do we need pre-collected data for learning/training ? 2019-09-26 11Machine learning and artificial neural network Data: 𝒊 𝟏 𝑵 Model with parameters: Cost/loss: Algorithm to solve ∗ 𝜽
  • 12. Machine learning: Introduction  Some terminologies  Learning/Training/Model fitting: process to find the model parameters ( ) that best fit to given data in terms of the predefined cost/objective  Supervised learning: target values ( ) are provided • Classification, regression  Unsupervised learning: no target values provided • Clustering 2019-09-26 12Machine learning and artificial neural network Data: 𝒊 𝟏 𝑵 Model with parameters: Cost/loss: Algorithm to solve ∗ 𝜽
  • 13. Machine learning: Introduction  Design steps (supervised learning) 1. Define the function you want to implement (define input and output ) 2. Design your model , intuitively and smartly 3. Collect data and curate them to set 4. Train the model to get ∗ 5. Use ∗ to evaluate the performance 6. If satisfied, you are done! Otherwise, go to step 2 (skip 3).  Step 2 requires strong/some mathematical background  Step 3 is typically time-consuming and sometimes requires domain expertise (e.g. for medical application) 2019-09-26 13Machine learning and artificial neural network
  • 14. Machine learning: Introduction  Design steps for beginner (supervised learning) 1. Choose a function you want to implement (input/output formats are pre-defined) 2. Search for some open SW packages to choose/construct an appropriate model and try to modify slightly 3. Download dataset ( , ) from the internet 4. Use the packages to train the model to get ∗ 5. Use ∗ to evaluate the performance 6. If satisfied, you are done! Otherwise, go to step 2 (skip 3). 2019-09-26 14Machine learning and artificial neural network
  • 15. Machine learning: Introduction  Parameters and hyper parameters  Most of the models have some hyper-parameters that are pre-defined before training  Must be optimized for performance, computing costs …  may need grid search to find the best combination of hyper parameters. 2019-09-26 15Machine learning and artificial neural network
  • 16. Machine learning: Introduction  Performance evaluation of classifier/regressor  Must consider “generalization error”  Typical performance measures  Classification: Accuracy  Regression: Mean Squared Error , R2 measure 2019-09-26 16Machine learning and artificial neural network 그림 1.1 분류기/추정기의 학습과 테스트
  • 17. Machine learning: Introduction  Clustering  No target values for observations  Objective is to divide data into a set of groups based on some similarity measures  Need to devise procedures to efficiently group data  Data (distribution) visualization may help  Once clustered, the data can be used for classification 2019-09-26 17Machine learning and artificial neural network
  • 18. Machine learning: Introduction  Two typical similarity measures  Euclidian distance:  Correlation: 𝒙 𝒙 𝒙 𝒙  Need to consider symmetricity and their ranges  Note  L-p norm of a vector: /  Default value of p = 2   Schwartz’s inequality: 2019-09-26 18Machine learning and artificial neural network
  • 19. Machine learning: Introduction  Simplest classifier: k nearest neighbor (knn) classifier  Training data 𝒊 𝟏 𝑵 used as templates  Given new input data , it determines its class as follows 1. Compute (may use other similarity measure) 2. Select k candidates nearest to 3. Use majority vote to determine the class of 2019-09-26 19Machine learning and artificial neural network Existing data 𝒊 𝟏 𝑵 knn classifierNew data Prediction
  • 20. Machine learning: Introduction  k nearest neighbor (knn) as regressor  Training data 𝒊 𝟏 𝑵 used as templates  Given new input data , it determines its class as follows 1. Compute (may use other similarity measure) 2. Select k candidates nearest to 3. Take average of k candidates to determine the estimates 2019-09-26 20Machine learning and artificial neural network Existing data 𝒊 𝟏 𝑵 knn regressorNew data Prediction
  • 21. Machine Learning and Neural Network Ch.2: Data and descriptive statistics Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 22. Ch.2 Data and descriptive statistics  Topics 1. Data: types and representation 2. Descriptive statistics  Scatter plot and histogram  Mean, correlation and covariance 2019-09-26 22Machine learning and artificial neural network
  • 23. Data and descriptive stat.  Terminologies and notation  Observation/sample/feature vector (for now, assume that it is a vector).  Target value : desired value for a sample  In supervised learning, and should be paired ( , )  Collection of data: 2019-09-26 23Machine learning and artificial neural network Each column is a sample each row is a feature
  • 24. Data and descriptive stat.  Two types of data:  Categorical  Numerical  Categorical value is typically mapped to an integer to make it suitable for computation  ex: T  1, F  0  Blood type: O  0, A  1, B  2, AB  3 2019-09-26 24Machine learning and artificial neural network
  • 25. Data and descriptive stat.  An example of multivariate (다변량) data  Data consisting of 20 samples  Each column is one sample with 4 features, (Group, English, Math, Science score)  call it feature vector  where Group is categorical and others are numerical 2019-09-26 25Machine learning and artificial neural network sid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ave. SD. Group A A A A A A A A A A B B B B B B B B B B English 77 81 74 89 78 77 75 84 79 82 69 76 74 67 74 71 67 67 70 69 75 6.09 Math 72 67 74 64 71 67 72 68 75 70 83 78 76 82 80 77 83 84 80 80 75 6.09 Science 75 68 72 68 74 68 75 73 79 72 78 75 72 74 77 70 75 79 77 76 74 3.47 a sample/observation
  • 26. Data and descriptive stat.  Example problems  Classification: Given , determine  Regression: Given , estimate 2019-09-26 26Machine learning and artificial neural network sid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ave. SD. Group A A A A A A A A A A B B B B B B B B B B English 77 81 74 89 78 77 75 84 79 82 69 76 74 67 74 71 67 67 70 69 75 6.09 Math 72 67 74 64 71 67 72 68 75 70 83 78 76 82 80 77 83 84 80 80 75 6.09 Science 75 68 72 68 74 68 75 73 79 72 78 75 72 74 77 70 75 79 77 76 74 3.47 a sample/observation
  • 27. Data and descriptive stat.  Data visualization: scatter plot and histogram 2019-09-26 27Machine learning and artificial neural network  Empirical (Probability) density gives us lots of information for the design and performance of classifier, regressor and clustering algorithm  One or two dimensional (bivariate) data is easy to visualize  While, more than 2D is hard  Pairwise scatter plot is affordable for small M
  • 28. Data and descriptive stat.  Problems in machine learning 2019-09-26 28Machine learning and artificial neural network Classification Regression Clustering Few prob. distribution models can be successfully applied to practical dataset That’s why we resort to machine learning based on a collection of samples
  • 29. Data and descriptive stat.  Mean, Correlation and Covariance  Consider dataset N samples and M features  (Per-feature) mean:  (Per-feature) variance:  is the standard deviation  ’s and ‘s can be collectively represented as a vector 2019-09-26 29Machine learning and artificial neural network
  • 30. Data and descriptive stat.  Mean, Correlation and Covariance  Dataset N samples and M features  Correlation (for a pair of features):  Covariance (for a pair of features):  , (symmetric)  ’s and ’s can be collectively represented as matrices 2019-09-26 30Machine learning and artificial neural network
  • 31. Data and descriptive stat.  Mean, Correlation matrix and Covariance matrix  Consider dataset N samples and M features  Mean (vector): 𝑿  Correlation matrix: 𝑿𝑿  Covariance matrix: 𝑿𝑿 𝑿𝑿 𝑿  Cross correlation: 𝑿𝒚  Cross covariance: 𝑿𝒚 𝑿𝒚 𝑿 𝒚 2019-09-26 31Machine learning and artificial neural network Size: Size:
  • 32. Data and descriptive stat.  Properties of (and )  𝑿𝑿 𝑻 𝑿𝑿 (symmetric)  𝑿𝑿 is non-negative definite, such that, for any vector , 𝑿𝑿  The eigen values are all non-negative and their eigen vectors form an orthonormal basis, i.e., with eigen decomposition 𝑿𝑿 , diagonal elements of are all non-negative real and  𝑿𝑿  If (the number of samples is less than the number of features), then 𝑿𝑿 has at most non-zero eigen values. (all others are zero). In this case, 𝑿𝑿 is not invertible  These properties also hold for 𝑿𝑿 2019-09-26 32Machine learning and artificial neural network
  • 33. Data and descriptive stat.  For the two data matrices and ,  Find 𝑿  Find 𝑿𝑿 and 𝑿𝑿  Check if 𝑿𝑿 and 𝑿𝑿 satisfies the properties in the previous slide. 2019-09-26 33Machine learning and artificial neural network
  • 34. Data and descriptive stat.  Example (문제 2.2)  Find the correlation and covariance between • English and math • English and science • Math and science  Find 𝑿𝑿 and 𝑿𝑿  Check if 𝑿𝑿 and 𝑿𝑿 satisfies the properties in the previous slide. 2019-09-26 34Machine learning and artificial neural network sid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ave. SD. English 77 81 74 89 78 77 75 84 79 82 69 76 74 67 74 71 67 67 70 69 75 6.09 Math 72 67 74 64 71 67 72 68 75 70 83 78 76 82 80 77 83 84 80 80 75 6.09 Science 75 68 72 68 74 68 75 73 79 72 78 75 72 74 77 70 75 79 77 76 74 3.47
  • 35. Homework & Computer Lab.  Homework: 2.1, 2.2 2019-09-26 35Machine learning and artificial neural network
  • 36. Machine Learning and Neural Network Ch.3: Multi-variate Gaussian PDF and linear transform Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 37. Ch.3 Multivariate Gaussian PDF & linear transform  Topics 1. Multi-variate Gaussian PDF  Pearson’s correlation coefficient 2. Linear transformation  Principal axes transform and whitening 3. Principal component analysis (PCA) 2019-09-26 37Machine learning and artificial neural network
  • 38. Multivariate Gaussian PDF: definition  Definition of multivariate Gaussian (Normal) PDF  Consider a Gaussian random vector 𝑻  The PDF of is defined, in general, as / 𝑪 / 𝑻 where is mean, is covariance matrix,  Note  Quadratic form (2차형식) 𝑻 is a scalar ( , × )  Mahalanobis distance: 𝑻 (Symmetric) 2019-09-26 38Machine learning and artificial neural network
  • 39. Multivariate Gaussian PDF  3 cases of bivariate Gaussian (Normal) PDF  Case 1: 𝝁 = 0 5 , 𝑪 = 9 0 0 9 , Case 2: 𝝁 = 0 5 , 𝑪 = 1 0 0 16  Case 3: 𝝁 = 0 5 , 𝑪 = 9 −10 −10 16 2019-09-26 39Machine learning and artificial neural network Mean is just a “translation” Contour plot
  • 40. Multivariate Gaussian PDF  Let’s take a closer look  “Contour” can be obtained from 𝑻  Suppose that for simplicity  𝑻  Suppose also that  𝟏 𝟏  Then, we have 𝑧 𝜎 − 2𝜌 𝑧 𝜎 𝑧 𝜎 + 𝑧 𝜎 = 𝑐′(1 − 𝜌 )  where is Pearson correlation coefficient defined as satisfying  We say and are uncorrelated if and has perfect correlation if 2019-09-26 40Machine learning and artificial neural network This is an ellipse
  • 41. Multivariate Gaussian PDF  Examples  Pearson correlation coefficient between two random variables (two features) and is defined as satisfying  We say that and are uncorrelated if and has perfect correlation if 2019-09-26 41Machine learning and artificial neural network
  • 42. Data and descriptive stat.  What can you see? 2019-09-26 42Machine learning and artificial neural network  Are Math and English scores correlated ?  What can you say about Math and English score? Set up your hypothesis.  Use the figure in the previous page to roughly estimate the Pearson correlation coefficient.
  • 43. Multivariate Gaussian PDF (참고사항)  Marginalization of an M-variate Gaussian PDF is also a Gaussian PDF with (M-1)-variates 𝒊 𝒊  Successive marginalization gives us a univariate Gaussian PDF 2019-09-26 43Machine learning and artificial neural network
  • 44. Linear transform  Definition of a linear transformation  For any matrix of size (KxM), linear transform of a vector of size (Mx1) is defined as  Linear transform is a projection of onto the row space of  Linear transform of a Gaussian random vector  Suppose that be a Gaussian RV with mean and cov. , i.e.,  Then, for any matrix , the linear transform is also Gaussian with mean and covariance , i.e.,  Try to verify using the def. of mean and covariance in Ch.2 2019-09-26 44Machine learning and artificial neural network
  • 45. Linear transform  Principal axes transformation and Whitening  Suppose that (eigen-decomposition of ) , : diagonal matrix with ( th eigen value) : eigen basis ( th column is the eigen vector for )  (Principal axes transform) The linear transform by using as transform matrix, is Gaussian with PDF  (Whitening) By using / as transform matrix, / is also Gaussian with PDF / 2019-09-26 45Machine learning and artificial neural network
  • 46. Principal Component Analysis (PCA)  Principal component analysis (PCA)  With  PCA uses several (typically two) eigen vectors corresponding to the largest eigen values as projection matrix.  Let • ( , ) be the two largest eigen values • ( , ) be the corresponding eigen vectors  We use as transform matrix  The distribution of can be easily visualized in a low dimensional (e.g., 2D) space.  If 𝑪 , contains most of the information on , i.e., 2019-09-26 46Machine learning and artificial neural network
  • 47. Data (distribution) visualization  Pairwise scatter plot is NOT affordable for large M 2019-09-26 47Machine learning and artificial neural network M = 4 M = 64 (showing only 10 features)
  • 48. Data (distribution) visualization 2019-09-26 48Machine learning and artificial neural network Pair-wise scatter plots of Iris dataset (3 classes, 4 dimensional feature) 2 dimensional projection provides better representation of clusters and similarity between feature
  • 49. Data (distribution) visualization 2019-09-26 49Machine learning and artificial neural network Pair-wise scatter plots of Digits dataset (10 classes, 64 dimensional feature) Showing only first 10x10 2 dimensional projection provides better representation of clusters and similarity between feature
  • 50. Homework & Computer Lab.  Homework: 3.1~3.6  Practice: ML_practice0_ch3_data_visualization_190817c.ipynb 2019-09-26 50Machine learning and artificial neural network
  • 51. Machine Learning and Neural Network Appendix A: Optimization I Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 52. Appendix: Optimization  Topics 1. Optimization I: Unconstrained optimization  Definition of optimization problem  Quadratic programming problem  Maximum likelihood estimation as an optimization problem 2. Optimization II: Iterative solutions  Gradient descent and stochastic gradient descent  Coordinate descent  Newton-Raphson method 3. Optimization III: Constrained optimization  Definition  Lagrange multiplier and Rayleigh quotient optimization  Duality in constrained optimization and KKT condition 2019-09-26 52Machine learning and artificial neural network
  • 53. Unconstrained optimization  Definitions of unconstrained optimization  Minimization: 𝜽∈ℝ or ∗ 𝜽  Minimization: 𝜽∈ℝ or ∗ 𝜽 where is a cost/objective function.  Convex optimization  If is a convex function, the solution can be obtained by solving (as there is only one minimum (maximum) 𝜽 where 𝜽 is the gradient operator 2019-09-26 53Machine learning and artificial neural network
  • 54. Unconstrained optimization: QP problem  Quadratic programming (QP) problem  QP problem is a special case of convex optimization problem  is a quadratic function of , i.e., 𝜽  Since is a convex function, the solution is given by solving  Solution: ∗ 𝜽 (if is invertible) 2019-09-26 54Machine learning and artificial neural network
  • 55. Unconstrained optimization: Gradient formula  Gradient operators  For vector : 𝜽 𝜽  For matrix : 𝑨 𝑨  Gradient formula  𝜽 𝜽  𝜽  𝑨  𝑨 𝟏 2019-09-26 55Machine learning and artificial neural network
  • 56. Unconstrained optimization: Gradient formula  Example (문제 A.1):  minimize , i.e., find ∗ ∗ that minimize and find also the minimum value ∗ ∗  Express in vector-matrix form, i.e.  Use the vector-matrix form to minimize (use the gradient formula)  Repeat for 2019-09-26 56Machine learning and artificial neural network
  • 57. Maximum likelihood estimation  Given  Data samples:  PDF model: with unknown parameter  We want to find that maximize  likelihood of :  Or log-likelihood:  It is a maximization problem ∗ 𝜽∈ℝ 𝜽∈ℝ 2019-09-26 57Machine learning and artificial neural network
  • 58. MLE example: Bernoulli trial  Given  Data samples: , where  PDF model: ( ) with  Parameter to estimate:  Likelihood function  Solution: ∗ 2019-09-26 58Machine learning and artificial neural network Try to verify this by maximizing the likelihood or log-likelihood function. where k is the number of 1’s occurred in N trials
  • 59. MLE example: Multi-variate Gaussian PDF (선택)  Given  Data samples: , where  PDF model: where : mean, : covariance matrix  parameters to estimate  Log-Likelihood function with 𝑻  Solution:  𝟏 𝑵 𝑵  𝟏 𝑵 𝑻 𝟏 𝑵 2019-09-26 59Machine learning and artificial neural network Try to verify this using gradient formula.
  • 60. Seokhyun Yoon, Electronics Eng., Dankook Uinversity Machine Learning and Neural Network Ch.4: Regression
  • 61. Roadmap 2019-09-26 61Machine learning and artificial neural network Ch.4 Linear Regression Ch.6 Ridge/Lasso regression Ch.7 Logistic regression Ch.8 Multi-task regression Ch.9 Neural Network Ch.10 Recurrent NN Ch.11 Convolutional NN D D D x(t) ŷ(t) h(t) h(t-1) Layer 2 Layer 1
  • 62. Ch.4 Regression  Topics 1. Linear regression 2. Vector-matrix representation of linear regression 3. Linear prediction 4. Non-linear regression and overfit 5. Performance evaluation: cross-validation 2019-09-26 62Machine learning and artificial neural network
  • 63. Regression  Elements of regression problem  Prediction model ( ) with parameters ( )  Data (observations and their target values )  Cost(loss)/Objective function to minimize/maximize  Algorithm to efficiently obtain the optimal or a good solution 2019-09-26 63Machine learning and artificial neural network Data: 𝒊 𝟏 𝑵 Model with parameters: Cost/loss: Algorithm to solve ∗ 𝜽
  • 64. Regression: Linear regression  A simple example of linear regression  Data: where  Model: where parameter  Problem is to find the best for given  Best in what sense ? 2019-09-26 64Machine learning and artificial neural network x y (xi, yi)
  • 65. Regression: Linear regression  Least squares solution (최소제곱법)  We want to minimize the residual sum of squares (RSS)  Define error:  Minimize: ;𝜽  where, is a quadratic (convex) function of and  Can use 𝜽 to find and in terms of 2019-09-26 65Machine learning and artificial neural network
  • 66. Regression: Linear regression  Generalization to multi-variate data  Data: where ,  Model: where parameter  Cost function: Residual sum of squares (RSS)  where  ;𝜽  Problem is to find ∗ 𝜽∈ℝ 2019-09-26 66Machine learning and artificial neural network
  • 67. Regression: Model structure  Model and its training at a glance 2019-09-26 67Machine learning and artificial neural network
  • 68. Regression: Linear regression  Solution  is a quadratic function of ’s (convex function)  Can use 𝜽 to obtain a system of equations  Then, solve the system of equations to get ∗ : :  Equivalently, in vector-matrix form, 𝑿 𝑿 𝑿 𝒚 where 𝑻 , 𝑿 𝑿 , 𝑿 𝒚 2019-09-26 68Machine learning and artificial neural network
  • 69. Regression: Vector matrix notation  Vector-matrix notation  Data: where ,  Model: where ,  Cost function: Residual sum of squares (RSS)  Error vector:   𝑿 𝑿 𝑿 𝒚 2019-09-26 69Machine learning and artificial neural network where 𝑿 = 𝟏 𝑻 𝑿 = 1 1 𝑥 𝑥 … 1 ⋯ 𝑥 ⋮ ⋮ 𝑥 𝑥 ⋱ ⋮ ⋯ 𝑥
  • 70. Regression: Vector matrix notation  Vector-matrix notation  Problem is to find the solution of 𝜽 , which is 𝜽 𝑿 𝑿 𝑿 𝒚 𝑿 𝑿 𝑿 𝒚  Solution: ∗ 𝑿 𝑿 𝑿 𝒚  Unique solution exists only if 𝑿 𝑿 is invertible! 2019-09-26 70Machine learning and artificial neural network
  • 71. Regression: Linear regression example  Example  We want to estimate English score using two models 영어점수 수학점수 영어점수 수학점수 과학점수  Find ( , ) and ( , , ), respectively. (you may use the results in 문제 2.2)  Homework: finish problem 4.1 and 4.2 2019-09-26 71Machine learning and artificial neural network sid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ave. SD. English 77 81 74 89 78 77 75 84 79 82 69 76 74 67 74 71 67 67 70 69 75 6.09 Math 72 67 74 64 71 67 72 68 75 70 83 78 76 82 80 77 83 84 80 80 75 6.09 Science 75 68 72 68 74 68 75 73 79 72 78 75 72 74 77 70 75 79 77 76 74 3.47
  • 72. Regression: Linear prediction  Linear prediction  Given time series data  Use p previous samples to predict the next sample, i.e., we want to predict using ( )  Model: ( )  Example 4.3 2019-09-26 72Machine learning and artificial neural network 𝑡 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 𝑥 -4 -3 14 8 1 -5 -7 -4 -2 6 10 22 15 -15 -20 ?
  • 73. Regression: Linear prediction  Linear prediction  Target value:  Data matrix:  Model: ( ) (no intercept) :  Solution: ∗ 𝜽 𝑿𝑿 𝟏 𝑿𝒚  Prediction: ∗ ∗ ( )  Note: 𝑿𝑿 is a Toeplitz matrix 2019-09-26 73Machine learning and artificial neural network
  • 74. Regression: Linear prediction  Homework: Example 4.3 1) 예측 차수 에 대해 와 를 나타내고 𝑿𝑿와 𝑿𝒚를 구하라. 에 대해 선형 예측기 파라미터 ∗ 를 구하고 를 예측해 보아라. 3) 평균 제곱 오차 ( ) 를 구하라. (N=14) 에 대해 (1)~(3)을 반복하라. 5) 시계열 데이터의 분산 를 구하고 (여기서 ), 에 대해 를 구하라. 6) (5)의 결과에 대해 간략히 비교 설명하라. 2019-09-26 74Machine learning and artificial neural network 𝑡 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 𝑥 -4 -3 14 8 1 -5 -7 -4 -2 6 10 22 15 -15 -20 ?
  • 75. Regression: Non-linear model and overfit  Example of Non-linear regression  Two-feature data  Non-linear model: where  Defining , RSS cost gives us ∗ 𝑿 𝑿 𝑿 𝒚  Note  The model is non-linear in ’s, but linear in ’s  RSS cost function gives us a linear system of equations 2019-09-26 75Machine learning and artificial neural network
  • 76. Regression: Non-linear model and overfit  Considerations for non-linear regression  If the model is non-linear function of ’s, the problem (finding solution) become complicated.  Non-linear model is subject to overfit (large generalization error), especially when the number of samples is relatively small compared to the number of parameters in the model.  We need to check if the model is overfitted to data or not. 2019-09-26 76Machine learning and artificial neural network 출처: https://slideplayer.com/slide/6825533/
  • 77. Regression: Non-linear model and overfit  Overfit, underfit and just(appropriate) fit 2019-09-26 77Machine learning and artificial neural network source: https://slideplayer.com/slide/6825533/ source : https://towardsdatascience.com/underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6fe4a8a49dbf
  • 78. Regression: Non-linear model and overfit  How to check if the model is overfitted or not  If the model is overfitted, generalization error is much (?) larger than the minimized cost for training data, i.e., ∗ ∗ where ∗ was obtained based on  That’s why we divide data (samples) for training and test for performance evaluation  More systematic approach to test overfit: cross validation 2019-09-26 78Machine learning and artificial neural network
  • 79. Regression: Non-linear model and overfit  L-fold cross-validation (교차 검증) 1. We divide the entire data (of N samples) into L groups (of N/L samples per group) 2. Select one group for test and use all others for training 3. Measure ∗ and ∗ 4. Repeat 2 and 3 for each group and take average on both measures 5. Check if ∗ ∗ 2019-09-26 79Machine learning and artificial neural network
  • 80. Homework & Computer Lab.  Homework: 4.1, 4.2, 4.3  Computer Lab: ML_practice1_regression_ex_190820.ipynb 2019-09-26 80Machine learning and artificial neural network
  • 81. Seokhyun Yoon, Electronics Eng., Dankook Uinversity Machine Learning and Neural Network Ch.5: Regularization
  • 82. Roadmap 2019-09-26 82Machine learning and artificial neural network Ch.4 Linear Regression Ch.6 Ridge/Lasso regression Ch.7 Logistic regression Ch.8 Multi-task regression Ch.9 Neural Network Ch.10 Recurrent NN Ch.11 Convolutional NN D D D x(t) ŷ(t) h(t) h(t-1) Layer 2 Layer 1
  • 83. Ch.5 Regularization  Topics 1. Ridge regression 2. LASSO regression 3. Elastic-net 2019-09-26 83Machine learning and artificial neural network
  • 84. Regularization  Recall linear regression  Data: where ,  Model: where ,  Cost function: Residual sum of squares (RSS)  where  𝑿 𝑿 𝑿 𝒚  Problem is to find the solution of 𝜽 , which is 𝜽 𝑿 𝑿 𝑿 𝒚  ∗ 𝑿 𝑿 𝑿 𝒚  Unique solution exists if 𝑿 𝑿 is invertible! What if it is NOT? 2019-09-26 84Machine learning and artificial neural network
  • 85. Regularization  In what case, is NOT invertible?  It is not if N < M, i.e., when the number of samples less than the number of features (e.g., as in bioinformatics, medical application)  Infinite number of solutions exist  The model parameters and performance can be highly variable with a small changes in data (overfit)  Two possible approaches  Increasing sample size (noise injection)  Reducing feature dimension (selecting good features) 2019-09-26 85Machine learning and artificial neural network
  • 86. Regularization  Increasing sample size (noise injection)  One can double the number of samples by generating new set of data where is random noise matrix with covariance , i.e., ,  Then, use as new data  Note that 𝑿 𝑿 𝑿 𝑿 , which is now invertible “anyway” if 2N > M  It is effectively a “noise injection”  generalization error can be reduced to some extent  If needed, one can add more with different random noise.  The noise variance must be chosen carefully.  Note: the distribution of may not model well the true distribution of . 2019-09-26 86Machine learning and artificial neural network
  • 87. Regularization  Reducing feature dimension (selecting features)  One can select M’ (<N) features, for example, having highest covariance with target value y.  However, this does not guarantee a better performance.  An efficient feature selection method (LASSO) will be discussed shortly 2019-09-26 87Machine learning and artificial neural network
  • 88. Regularization: Ridge and LASSO  Ridge and LASSO regression: RSS + L1/L2 Penalty    Lp-norm: /  controls the relative weight between RSS and penalty  Elastic net: RSS + L1 + L2 Penalty  2019-09-26 88Machine learning and artificial neural network
  • 89. Regularization: What is the impact of penalty?  Ridge regression  Ridge regression is simply a QP problem  And the solution is ∗ 𝑿 𝑿 𝑿 𝒚  𝑿 𝑿 is invertible with , even if 𝑿 𝑿 is not. (문제6.3 참고)  It is effectively a “noise injection” (an increase of sample size)  and generalization error can be reduced to some extent 2019-09-26 89Machine learning and artificial neural network
  • 90. Regularization: What is the impact of penalty?  LASSO regression  LASSO stands for Least Absolute Shrinkage and Selection Operator  It tends to select features that describe well the target value, y  some ’s vanish if the corresponding features doesn’t have strong correlation to y  LASSO effectively reduce M, rather than to increase N. 2019-09-26 90Machine learning and artificial neural network
  • 91. Regularization: What is the impact of penalty?  Further remarks on LASSO regression  controls sparsity (high selects less features)  LASSO tends to select one feature from a group of highly correlated variables (features) and ignore the rest.  Unlike , L1-penaty is not differentiable at  LASSO regression is convex optimization problem, while it is NOT a simple QP problem  use iterative algorithm to find the solution, especially when M>N (Coordinate descent algorithm to be discussed next)  See textbook for coordinate descent algorithm for LASSO 2019-09-26 91Machine learning and artificial neural network
  • 92. Regularization: Elastic-net  Elastic-net  Elastic-net combines L1 and L2 penalty  L1-penalty selects features (generating sparse model)  L2-penalty reduces generalization error and also encourage grouping effects. 2019-09-26 92Machine learning and artificial neural network Homework & Computer Lab.  Homework: 6.2, 6.3  Computer lab: ML_practice1_regression_ex_190820.ipynb
  • 93. Machine Learning and Neural Network Appendix C: Optimization III Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 94. Appendix: Optimization  Topics 1. Optimization I: Unconstrained optimization  Definition of optimization problem  Quadratic programming problem  Maximum likelihood estimation as an optimization problem 2. Optimization II: Iterative solutions  Gradient descent and stochastic gradient descent  Coordinate descent  Newton-Raphson method 3. Optimization III: Constrained optimization  Definition  Lagrange multiplier and Rayleigh quotient optimization  Duality in constrained optimization and KKT condition 2019-09-26 94Machine learning and artificial neural network
  • 95. Unconstrained optimization  Definitions of unconstrained optimization  Minimization: 𝜽∈ℝ or ∗ 𝜽  Minimization: 𝜽∈ℝ or ∗ 𝜽 where is a cost/objective function.  Convex optimization  If is a convex function, the solution can be obtained by solving (as there is only one minimum (maximum) 𝜽  Sometimes, however, one cannot get closed form solution.  What can we do, then ? 2019-09-26 95Machine learning and artificial neural network
  • 96. Iterative search for minimum/maximum  One idea: gradient search  Gradient descent  Hill climbing  Steps  Given cost function J()  Initialize n = 0, (n) = 0  Loop (epoch): 1. Compute gradient at the current position, 𝜽 𝜽 𝜽( ) 2. Update param., ( ) ( ) 3. n  n+1 4. Repeat 1~3 until convergence 2019-09-26 96Machine learning and artificial neural network  𝜂: Learning rate, 0 < 𝜂 ≪ 1  Small enough 𝜂 ensures that  Large 𝜂: Fast convergence, but high MSE due to bouncing  Small 𝜂: Slow convergence, while lower MSE 𝐽 𝜽 ≥ 𝐽 𝜽
  • 97. Iterative search for minimum/maximum  Stochastic gradient descent (SDG)  Cost is typically a sum of per-sample cost  Update for every sample  Steps  Initialize  = 0  Outer Loop (epoch): for n = 1,2,… • Inner loop: for i = 1,2,…,N (number of samples) ( ) ( ) 𝜽 • Repeat inner loop until convergence 2019-09-26 97Machine learning and artificial neural network
  • 98. Iterative search for minimum/maximum  In linear regression   𝜽 (gradient of per-sample cost)  SGD for linear regression  Initialize  = 0, n=0  Outer Loop (epoch): for n = 1,2,… • Inner loop: for i = 1,2,…,N (number of samples) 𝑒 ( ) = 𝑦 − 𝒙 𝜽( ) ( ) ( ) ( ) • Repeat inner loop until convergence 2019-09-26 98Machine learning and artificial neural network
  • 99. Iterative search for minimum/maximum  Using momentum  In SGD, if each sample contains “noise”, it disturbs the algorithm, i.e., parameter may move to incorrect direction  It can be alleviated using momentum ( ) ( ) 𝜽 ( ) ( ) ( ) where 2019-09-26 99Machine learning and artificial neural network
  • 100. Iterative search for minimum/maximum  Coordinate descent  Rather than to update every parameters at a time  Update parameters one by one (one coordinate at a time) 𝜃 = argmin 𝐽 𝜃 , 𝜃 , … , 𝜃 , 𝜃 , 𝜃 , … , 𝜃  is given by the solution of equation 𝐽(𝜽) 𝜽𝒌 [ , ,…, , ,…, ] = 0  Simpler implementation 𝜃 = argmin 𝐽 𝜃 , 𝜃 , … , 𝜃 , 𝜃 , 𝜃 , … , 𝜃 2019-09-26 100Machine learning and artificial neural network
  • 101. Iterative search for minimum/maximum  Coordinate descent for linear regression  Cost: 𝟐    With , and , , we have , , ,  Update rule: ( ) , , ( ) ,  Homework: C.1 2019-09-26 101Machine learning and artificial neural network
  • 102. Machine Learning and Neural Network Ch.6: Classification Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 103. Ch.6 Classification: problem formulation  Topics 1. Bayesian approach 2. Bayesian approach under Gaussian assumption  Decision boundary 3. Linear model as a special case 2019-09-26 103Machine learning and artificial neural network
  • 104. Classification: Problem formulation  Data  Data: where ,  where is a set of categories(classes)  ’s are categorical and discrete  Bayesian approach: probabilistic model  Assume each class (kth class) is distributed ~ p(x|Hk).  Given new data x, decide its class y as ∈  i.e., select class index for which the conditional probability of x is maximum 2019-09-26 104Machine learning and artificial neural network
  • 105. Classification: Bayesian approach  Binary classification  Assume binary classification (for simplicity), i.e.,  Given new data x, decide its class y by comparing log- likelihood  Binary classification under Gaussian assumption  Assume with parameter and .  Then, we have 𝑻 𝑻 2019-09-26 105Machine learning and artificial neural network
  • 106. Classification: Bayesian approach  Binary classification under Gaussian assumption  Suppose that . Then, we have 𝑻 𝑻  i.e., compare (Mahalanobis) distances of x from class centers 2019-09-26 106Machine learning and artificial neural network 𝑝 𝒙|𝐻 𝑝 𝒙|𝐻
  • 107. Classification: Decision boundary  Decision boundary  It is a “surface” where , i.e., 𝑻 𝑻  It can be written as 𝑻 𝑻  where (a vector) (a scalar)  The decision boundary is given by “conic section”  which can be an hyperbola, an ellipse or a (hyper) plane 2019-09-26 107Machine learning and artificial neural network
  • 108. Classification: Linear model  Linear model for binary classification  Suppose further that .  Then, the decision boundary becomes 𝑻 𝑻 𝑻  which is a (hyper) plane  And the decision rule becomes 𝑻 or equivalently, 𝑻  Model parameter: and (intercept)  Linear classifier partitions ( ? ) into non-overlapping areas using ( ? ) 2019-09-26 108Machine learning and artificial neural network
  • 109. Classification: Linear model vs. Bayesian approach  Bayesian classifier versus linear classifier 2019-09-26 109Machine learning and artificial neural network 𝑝 𝒙|𝐻 𝑝 𝒙|𝐻 𝜽 𝑻 𝒙 + 𝜃 = 0
  • 110. Classification: Summary  Binary classification: summary  Bayesian approach  Under Gaussian assumption (with ) 𝑻 𝑻  With , we get linear model 𝑻 or equivalently, 𝑻 2019-09-26 110Machine learning and artificial neural network Our main focus is on this linear model
  • 111. Classification: Naive implementation  Naive implementation  Given data: where ,  is a set of categories (classes)  ’s are categorical and discrete, e.g.,  Divide data into and (for each class)  Compute for  Use ’s for classification  This is not our focus, though. 2019-09-26 111Machine learning and artificial neural network
  • 112. Classification: Roadmap  Based on the model,  Ch.7: We will develop training (learning) rule, where we obtain and directly from data by solving an optimization problem  Ch.8: The linear model will be extended for multinomial classification problem  Ch.9: The model will be further extended to get neural network model : 2019-09-26 112Machine learning and artificial neural network
  • 113. Homework & Computer Lab.  Homework: 5.1, 5.2, 5.3 2019-09-26 113Machine learning and artificial neural network
  • 114. Machine Learning and Neural Network Ch.7: Logistic Regression (binary classification) Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 115. Roadmap 2019-09-26 115Machine learning and artificial neural network Ch.4 Linear Regression Ch.6 Ridge/Lasso regression Ch.7 Logistic regression Ch.8 Multi-task regression Ch.9 Neural Network Ch.10 Recurrent NN Ch.11 Convolutional NN D D D x(t) ŷ(t) h(t) h(t-1) Layer 2 Layer 1
  • 116. Ch.7 Logistic Regression for binary classification  Topics 1. Logistic regression:  Model with logistic sigmoid function 2. Parameter optimization:  Likelihood function as an objective function  Application of gradient search algorithm 3. Performance measures of binary classifier  Confusion matrix, True Positive and False negative  Accuracy, Sensitivity, Specificity  ROC and AUC 2019-09-26 116Machine learning and artificial neural network
  • 117. Logistic regression: Model  Recall (generalized) linear model for binary classification,  It is a linear regressor if  It is a linear classifier if  It is a logistic regressor if ( ) 2019-09-26 117Machine learning and artificial neural network
  • 118. Logistic regression: Model  Interpretation of logistic regression model , where   can be regarded as , so that ( 𝜽 𝑻 𝒙) ( 𝜽 𝑻 𝒙) (𝜽 𝑻 𝒙)  ( ) ( 𝜽 𝑻 𝒙) (( )𝜽 𝑻 𝒙) where  can also be interpreted as “class estimate”.  In both case, if , is likely to be class 1. Otherwise class 0.  𝑻 is called “odds” of being class 1. (Note: 𝑻 ). 2019-09-26 118Machine learning and artificial neural network
  • 119. Logistic regression  Geometrical interpretation 2019-09-26 119Machine learning and artificial neural network Decision boundary 𝑻 Decision variable 𝑧 = 𝜽 𝑻 𝒙 + 𝜃 (odd of 𝒙 belonging to class 1) Class 1 Class 0
  • 120. Logistic regression: Cost function  Cost function: Negative log-likelihood  𝑻 can be interpreted as probability (likelihood) that belongs to class 1.  Likelihood that belongs to the target class is given by  Log-likelihood as an “objective” to maximize  Can also be formulated as minimization of 2019-09-26 120Machine learning and artificial neural network
  • 121. Logistic regression  Elements of regression/classification problem  Data (observations and their target values )  Prediction model ( ) with parameters ( )  Cost(loss)/Objective function to minimize/maximize  Algorithm to efficiently obtain the optimal or a good solution 2019-09-26 121Machine learning and artificial neural network Data: 𝒊 𝟏 𝑵 Model with parameters: 𝑻 ( 𝜽 𝑻 𝒙 ) Cost/loss: Algorithm to min/maximize: Gradient descent
  • 122. Logistic regression: Optimization  Optimization  contains non-linear function ( ) .  𝜽 isn’t a simple QP problem.  We resort gradient search to get optimal (or a good) solution.  To perform gradient search, we need gradient of the cost, which is given by (see textbook p.68) 𝜽  Algorithm (pseudo code)  Initialize ( )  ( ) ( ) ( ) for . 2019-09-26 122Machine learning and artificial neural network “+” means hill-climbing
  • 123. Logistic regression: Another cost function  Another cost: Residual sum of square (RSS)  𝑻 can also be interpreted as class estimate.  Define estimation error:  RSS as a cost to minimize  Gradient (see textbook p.68) 𝜽  Gradient descent ( ) ( ) ( ) for .  What’s difference from likelihood based optimization? 2019-09-26 123Machine learning and artificial neural network “-” means gradient descent
  • 124. Performance measures of binary classifier  Confusion matrix     2019-09-26 124Machine learning and artificial neural network  Why do we need other measures than accuracy?  In some application, FN (FP) causes more serious problem than FP (FN)  E.g., in medical application, you want to make decision if a person has tumor (P) or not (N). It isn’t a big problem if a normal person (without tumor) is decided to have tumor (FP). But, the opposite case (a person with tumor decided as normal, FN) may cause serious problem.  You may want to minimize FPR requiring TPR no less than a certain threshold.
  • 125. Performance measures of binary classifier  ROC and AUC  ROC: Receiver operating characteristic  AUC: Area under (the ROC) curve 2019-09-26 125Machine learning and artificial neural network 1 1 0 FPR = FP/(FP+TN) TPR =TP/(TP+FN) AUC (면적) 결정 경계 에 따른 성능 변화 Positive(1) Negative(0) TP, FP down TN, FN up TP, FP up TN, FN down
  • 126. Homework & Computer Lab.  Homework: 7.1, 7.2  Practice: ML_practice2_classification_ex_190820.ipynb 2019-09-26 126Machine learning and artificial neural network
  • 127. Machine Learning and Neural Network Ch.8: Multi-task regression and multinomial classification Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 128. Roadmap 2019-09-26 128Machine learning and artificial neural network Ch.4 Linear Regression Ch.6 Ridge/Lasso regression Ch.7 Logistic regression Ch.8 Multi-task regression Ch.9 Neural Network Ch.10 Recurrent NN Ch.11 Convolutional NN D D D x(t) ŷ(t) h(t) h(t-1) Layer 2 Layer 1
  • 129. Ch.8 Multiclass classification  Topics 1. Multi-task regression 2. multinomial classification 3. Generalized linear model 2019-09-26 129Machine learning and artificial neural network
  • 130. Multi-task linear regression  Linear regression with vector target  Data: where , where : KxN matrix with each column being  Linear model: 𝑻 where : Kx(M+1) matrix (including intercept)  Define  ( ): th row of . ( : th column of )  : th column of  Cost function (RSS) 𝑻 ( ) ( ) 2019-09-26 130Machine learning and artificial neural network
  • 131. Multi-task linear regression  Linear regression with vector target  Cost function is a sum of RSS for each target value ( ) ( )  Optimization can be performed separately for each target value, i.e., 𝚯 𝜽 ( )  where 𝜽 ( ) gives 𝑿 𝑿 𝑿 𝒚( )  And 𝚯 gives 𝑿 𝑿 𝑿 𝒀  Can be implemented using K parallel linear regressors with scalar target value 2019-09-26 131Machine learning and artificial neural network
  • 132. Multi-task linear regression  Linear regression with vector target  Can be implemented using K parallel linear regressors with scalar target value  Alternative expression of cost function 𝑻 𝑻 2019-09-26 132Machine learning and artificial neural network
  • 133. Multinomial classification: two approaches  multinomial classification can be implemented using multiple binary classifiers.  Two approaches (K class case)  One against the rest:  we use K binary classifiers, one for each class.  Each classifier (kth classifier) compute, for example, the likelihood ( ) of input x belonging to the kth class.  Decide having the highest likelihood  Pairwise binary classification + majority voting:  we use K(K-1)/2 binary classifiers for each pair of classes.  Decide class by taking majority of the winners. 2019-09-26 133Machine learning and artificial neural network
  • 134. Multinomial logistic regression  Data  Data: where ,  where is a set of categories (classes)  ’s are categorical and discrete  Considerations  (single-task) logistic regressor using (integer) as target value will not work well (because ’s are categorical, while single-task regressor regards ’s as numerical.)  One approach is to encode ’s to a binary vector (of size Kx1) and use multi-task logistic regressor 2019-09-26 134Machine learning and artificial neural network
  • 135. Multinomial logistic regression  Model  Softmax function on top of multi-task linear regressor  Multi-task linear regressor for (odds of belonging to class ) Or, collectively,  softmax function ∑ (likelihood of belonging to class )  Note that and 2019-09-26 135Machine learning and artificial neural network
  • 136. Multinomial logistic regression  Cost/objective  can be interpreted as Pr{ belongs to class }  Log-likelihood can be used as the objective to maximize.  Gradient: 𝜽 where  Gradient search: ( ) ( ) 𝜣 𝜣 𝜣( ) for . 2019-09-26 136Machine learning and artificial neural network Since 0 ≤ 𝑆 (𝜣 𝒙 ) ≤ 1, the direction of gradient is either 𝒙 for 𝑘 = 𝑦 or −𝒙 for 𝑘 ≠ 𝑦
  • 137. Multinomial logistic regression: more issues  One hot encoding  One hot encoding is a mapping of an integer to a binary vector .., such that , i.e., only one element of is 1 and all others are 0.  Example: ,  By encoding all the target values , , .., , , we have  is a KxN matrix with each column being  Then, the gradient is given by 𝜽 , 2019-09-26 137Machine learning and artificial neural network
  • 138. Multinomial logistic regression : more issues  Cross-entropy  With one hot encoding: , , .., ,  is the probability mass of  Posterior likelihood of : , , , with ,  The cross entropy between and is given by ,  We call “cross-entropy cost”. 2019-09-26 138Machine learning and artificial neural network
  • 139. Multinomial logistic regression : more issues  Multi-task logistic regressor  Using one hot encoding, one can replace (for simplicity) the softmax function with K separate logistic sigmoid function  K parallel logistic regressors.  Performance ? 2019-09-26 139Machine learning and artificial neural network s(o1) s(oK)s(o2) 1 2 K x0 x1 x2 xM o1 o2 oK ̂p1 ̂p2 ̂pK  Other remarks  Multinomial logistic regression is one-against the rest approach.  Once the likelihoods ’s are obtained, the class estimate is determined by
  • 140. Multinomial logistic regression: generalization  Generalized linear model  Linear regression and logistic regressions can be represented by one structure  Consisting of an “activation function” on top of multi- task linear regressor  The output can be interpreted in various ways (e.g., as likelihoods or as estimates of target value) 2019-09-26 140Machine learning and artificial neural network  Also, there are many options for activation function (e.g., linear, sigmoid or tanh)  If input is categorical, apply one hot encoding before feed to regressor (input dimension must be changed too)
  • 141. Multinomial logistic regression: generalization  Generalized linear model  Regularization can also be applied if desired by defining cost with penalty 𝐅 𝟐  where  For linear regression:  For logistic regression:  basically regards the input and output as numerical. So, if you deal with categorical values, you need apply one hot encoding first. 2019-09-26 141Machine learning and artificial neural network
  • 142. Homework & Computer Lab.  실습: ML_practice2_classification_ex_190820.ipynb 2019-09-26 142Machine learning and artificial neural network
  • 143. Machine Learning and Neural Network Ch.9: Artificial neural network Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 144. Ch.9 Artificial neural network  Topics 1. Perceptron and artificial neural network (NN) 2. Neural network model 3. Training NN: backpropagation 4. Some issues on NN  Convergence to local minima  Overfitting  Vanishing gradient problem 5. Practical considerations (building and training NN) 2019-09-26 144Machine learning and artificial neural network
  • 145. Roadmap 2019-09-26 145Machine learning and artificial neural network Ch.4 Linear Regression Ch.6 Ridge/Lasso regression Ch.7 Logistic regression Ch.8 Multi-task regression Ch.9 Neural Network Ch.10 Recurrent NN Ch.11 Convolutional NN D D D x(t) ŷ(t) h(t) h(t-1) Layer 2 Layer 1
  • 146. ANN: Perceptron  Perceptron  It is an array of neurons interconnected, exactly the same as in generalized linear models  It was suggested mimicking biological neuron 2019-09-26 146Machine learning and artificial neural network Biological neuron source: https://en.wikipedia.org/wiki/Biological_neuron_model Regression model (Artificial neuron) f(net) Neuron Input nodes (dendrites) Output nodes (axon terminal) Activation function (synaptic) Weights x0 x1 x2 xM 0 1 2 M net T x y
  • 147. ANN: Perceptron  Perceptron  Multi-task regression model is an horizontal array of artificial neurons, with either combined activation or separate activation 2019-09-26 147Machine learning and artificial neural network ̂y1 f(o1, o2,…, oK) ̂y2 ̂yK 1 2 K x0 x1 x2 xM o1 o2 oK with combined activation s(o1) s(oK)s(o2) 1 2 K x0 x1 x2 xM o1 o2 oK ̂p1 ̂p2 ̂pK with separate activation
  • 148. ANN: Multi-layer Perceptron  Multi-layer Perceptron  Consists of multiple layers of multi-task regressors vertically stacked  Output of one layer is fed to the input of the next layer.  Number of layers and number of neurons per layer can be arbitrarily set  Non-linear activation function make it different from single- layer (linear) model, i.e., it makes the model non-linear  Can be used for regression and classification 2019-09-26 148Machine learning and artificial neural network
  • 149. ANN: Multi-layer Perceptron  Operations  Feedforward (prediction phase): For given input and the current parameter , it produce an output  Feedback (training phase): For each input and target vector , the parameter ’s are updated  Gradient search is used for some optimality criteria 2019-09-26 149Machine learning and artificial neural network
  • 150. ANN: Multi-layer Perceptron  Structure definition  Number of layers:  Number of neurons per layer:  Full connection assumed  Signals and parameters  Input:  Target vector:  Weight matrix: ( )  Hidden layer output: ( )  Final output: ( ) 2019-09-26 150Machine learning and artificial neural network
  • 151. ANN: Multi-layer Perceptron  Feedforward (prediction) From to 1) ( ) ( ) ( ) ( ( ) ) 2) ( ) ( ) ( ( ) )  More simply, ( ) ( ) ( )  ( ) is 1-augmented version of ( )  ( ) is matrix including “intercept”  activation function is applied to each element of ( ) 2019-09-26 151Machine learning and artificial neural network
  • 152. ANN: Multi-layer Perceptron  Feedback (training)  Assume training is performed in per-sample basis, i.e., SGD  Cost function (RSS): ( , ,…, ) ( ) ( ) ( )  Cross-entropy can also be used as cost (not covered here)  To train the model, we need 𝑾( ) for  Top layer is easy: 𝑾( ) ( ) ( ) ( ) ( ) , where ( ) ( ) ( ) and ( ) ( ) ( )  Layer below ? We need to apply chain rule  The problem, however, is not as simple as you expect. See textbook, section 9.3 2019-09-26 152Machine learning and artificial neural network
  • 153. ANN: Multi-layer Perceptron  Feedback (training)  The training starts from top layer and run through downward, one by one.  Training: From to : ( ) ( ) ( ) with ( ) 𝑾( ) ( , ,…, ) where, by applying chain rule (see textbook p.81-82) ( ) ( ) ( ) ( ) ( ) ( ) ( )  We call it “backpropagation (BP)” as it is performed backward (downward), opposite to feedforward operation. 2019-09-26 153Machine learning and artificial neural network
  • 154. ANN: Multi-layer Perceptron  Back-propagation (BP) algorithm  From to : ( ) ( ) ( ) where ∆𝑤 ( ) = δ ( ) z ( ) δ ( ) = 𝑓′ 𝑎 ( ) 𝑦 − z ( ) δ ( ) = 𝑓′ 𝑎 ∑ 𝑤 δ 2019-09-26 154Machine learning and artificial neural network Vector-matrix form ∆𝑾( ) = 𝛅 𝐳 𝛅 = 𝑓′ 𝒂( ) ⨀ 𝒚 − 𝒛( ) 𝛅 = 𝑓′ 𝒂( ) ⨀ 𝑾( ) 𝛅( ) ⨀: element-wise product
  • 155. ANN: Multi-layer Perceptron  Activation function  Except for the top (output) layer, activation function should be non-linear for a hidden layer to be effective.  Any monotonically increasing function can be used.  They are typically s-shaped, e.g., logistic sigmoid or tanh  ReLU or leaky ReLU are widely used recently. ReLU: Leaky ReLU: with 2019-09-26 155Machine learning and artificial neural network
  • 156. ANN Issues: Convergence to local minima  Convergence to local minima  NN is a non-linear model and the cost J is not convex.  The number of minima/maxima is not known  Gradient search does not guarantee the convergence to the global minimum  The local minima we get depends on the initial setting of W  There are no systematic approaches to achieve global minimum yet.  Simulated annealing, Genetic algorithms were proposed as heuristic solutions 2019-09-26 156Machine learning and artificial neural network
  • 157. ANN Issues: Overfitting  Overfitting  NN model has so many parameters (W(1),W(2),…,W(L))  Deep NN is especially the case  Similar to linear model, where N << M, NN with too much parameters may be easily overfitted to the training data  Three approaches to relieve overfitting  Noise injection: Increasing the number of data by adding noise  reduce generalization error (to some extent)  Regularization technique: add L1/L2 penalty to the cost function  similar impact to noise injection  Dropout ? 2019-09-26 157Machine learning and artificial neural network
  • 158. ANN Issues: Overfitting  Dropout: avoiding co-adaptation of neurons  Useful for Convolutional NN (for image)  At each training phase (for a batch of samples), we randomly select a portion of neurons (with probability p) and disable them  Can avoid many neurons co-adapted to each other (avoid many neurons activated to similar data)  Many NN packages support dropout layer as an option 2019-09-26 158Machine learning and artificial neural network
  • 159. ANN Issues: Vanishing gradient  Vanishing gradient problem  This is also a typical problem in deep neural network.  BP (training) starts from top layer and run through downward one-by-one, recursively.  Recall: ( ) , where ( ) ( ) ( ) where  With sigmoid function, (it’s mostly close to 0)  ’s are computed recursively  As BP run through downward, gets smaller and smaller, and so does ( )  vanishing gradient  If NN has many layers, effective learning rate in bottom layers gets very small, i.e., neurons in bottom layers are hardly trained  take to much time to be trained 2019-09-26 159Machine learning and artificial neural network
  • 160. ANN Issues: Vanishing gradient  Vanishing gradient problem  Using ReLU or leaky ReLU may help alleviate vanishing gradient problem.  Unsupervised learning based pre-training of bottom layers was proposed, though not so widely used recently. 2019-09-26 160Machine learning and artificial neural network
  • 161. ANN Issues: Building NN model  To build a neural network model, you need to consider first  Input and output dimension?  How many layers? ( )  How many neurons for each layer? ( )  Activation function ? (sigmoid, tanh, ReLU or leaky ReLU)  Dropout layer? With what probability? (p)  What cost function ? (RSS or cross-entropy)  Which optimizer to use? (simple SGD w/wo momentum .. )  Batch size?  Regression or classification ? (For regression, top layer activation is typically set linear) 2019-09-26 161Machine learning and artificial neural network
  • 162. ANN Issues: Training NN model  When training NN, you need to check  Overfitting (compare performance with training and test data while training the model)  Vanishing gradient (check if training takes too much time)  Convergence to bad local minima (you can train many times or train multiple instances in parallel with different initial values) 2019-09-26 162Machine learning and artificial neural network Computer Lab.  Practice: ML_practice3_NN_ex.ipynb
  • 163. Machine Learning and Neural Network Ch.10: Recurrent neural network (RNN) Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 164. Ch.10 Recurrent neural network  Topics 1. Model structure and operation. 2. RNN Training: backpropagation through time (BPTT) 3. LSTM (long/short term memory) 2019-09-26 164Machine learning and artificial neural network
  • 165. Roadmap 2019-09-26 165Machine learning and artificial neural network Ch.4 Linear Regression Ch.6 Ridge/Lasso regression Ch.7 Logistic regression Ch.8 Multi-task regression Ch.9 Neural Network Ch.10 Recurrent NN Ch.11 Convolutional NN D D D x(t) ŷ(t) h(t) h(t-1) Layer 2 Layer 1
  • 166. RNN: Recurrent neural network  Features  Recurrence means output fed back to input  Necessarily, the input is a time-series data  Example on the right consist of two layers  The hidden layer output is fed back to input with one sample delay (D) 2019-09-26 166Machine learning and artificial neural network  Layer 2 has no feedback loop (conventional NN layer)  Main applications are speech recognition, language modelling (machine translation, sentence completion), where data is given as time series 𝒉( ) = 𝑓 𝑼𝒙( ) + 𝑽𝒉( ) 𝒚( ) = 𝑓 𝑾𝒉( )
  • 167. RNN: Recurrent neural network  Model  Consider 1-layer RNN for simplicity  Input: ( ) (time-series)  Output (state): ( ) (time-series)  Feedforward operation: ( ) ( ) ( )  Output depends on both ( ) and previous output (state) ( )  Feedforward operation can also be expressed as ( ) ( ) ( )  Initial condition: Asuume ( )  ( ) ( ) 2019-09-26 167Machine learning and artificial neural network f(·) h(t) x(t) h(t-1) U V (a) RNN with a loop D g(t)
  • 168. RNN: Recurrent neural network  Unfolded model 2019-09-26 168Machine learning and artificial neural network f(·) h(t) x(t) h(t-1) U V (a) RNN with a loop D g(t)
  • 169. RNN: Training  RNN Training (textbook 10.2)  Cost function: ( ) ( ) ( ) ( ( ): target vector)  Gradient can be obtained by applying chain rule.  Gradient w.r.t. (at time ) 𝑽 ( ) 𝑽 where ( ) 𝑽 ( ) 𝒈( ) 𝒈( ) 𝒈( ) 𝒈( ) 𝑽 𝒈( ) 𝒈( ) 𝒈( ) 𝒈( ) 𝒈( ) 𝒈( ) ( ) 𝒈( ) 𝒈( ) 𝑽  With 𝒈( ) 𝒈( ) ( ) and ( ) 𝒈( ) 𝒈( ) 𝑽 ( ) ( ) , ( ) 𝑽 ( ) ( ) ( ) 2019-09-26 169Machine learning and artificial neural network
  • 170. RNN: Training  RNN Training (textbook 10.2)  Gradient of (at time ) 𝑼 ( ) 𝑼 where ( ) 𝑼 ( ) 𝒈( ) 𝒈( ) 𝒈( ) 𝒈( ) 𝑼  In the same way as for the gradient w.r.t. , we have ( ) 𝑼 ( ) ( ) ( )  To update and , we need perform BP through time (from to ).  We call it backpropagation through time (BPTT). 2019-09-26 170Machine learning and artificial neural network
  • 171. RNN: Training  Vanishing and exploding gradient  Looking at 𝑼 (and also 𝑽 ), the gradient contains ( )  where for any activation function we considered, ( ) (matrix norm)  Assuming , we have ( ) (mostly , why?)  As , l.h.s. goes to 0 if (vanishing gradient) or to  if ( ) /( ) (exploding gradient)  The latter seldom occurs. 2019-09-26 171Machine learning and artificial neural network
  • 172. RNN: Training  Forgets past inputs/outputs quickly  We also have for 𝑼 ( ) ( ) ( ) ( ) ( )  RNN is supposed to memorize the past inputs (in the system state) to deal with time-series data.  With , however, as gets large.  This means the system forgets past inputs quickly.  There are many examples where we need long term memory to correctly catch what exactly the sentence means. 2019-09-26 172Machine learning and artificial neural network
  • 173. RNN: Training  RNN summary  Due to recurrence nature, RNN training requires backpropagation through time (to t=1)  If T gets large, the gradient may vanish or explode.  the training rule should be carefully tuned.  As in most case, vanishing gradient occurs more frequently than exploding gradient  One solution to avoid vanishing/exploding gradient problem is to perform BPTT only for finite length of time window. (unfolded model of finite length) 2019-09-26 173Machine learning and artificial neural network
  • 174. RNN: LSTM  Long/Short term memory (LSTM)  a variant of RNN (proposed in 1997) to solve (partly) the vanishing gradient and to make system memory longer.  Vanilla RNN and LSTM  3 gates (forgetting/input/output gate) + main path  Two separate states: ( ) and (𝒕) 2019-09-26 174Machine learning and artificial neural network
  • 175. RNN: LSTM  LSTM operation  Gating function: ( ) ( ) ( ) (forget gate) ( ) ( ) ( ) (input gate) ( ) ( ) ( ) (output gate)  Cell state update: (𝒕) ( ) ( ) ( ) ( ) ( ) (long-term memory) ( ) ( ) ( ) (short-term memory) 2019-09-26 175Machine learning and artificial neural network
  • 176. RNN: LSTM  LSTM operation  Cell state update: (𝒕) ( ) ( ) ( ) ( ) ( ) (long-term memory) ( ) ( ) ( ) (short-term memory, final output)  When ignoring the gating function, ( ) is simply a sum of ( ) and the new input ( ) ( )  can keep long-term memory  ( ) select important features from previous state ( ), which comprise a part of the current output ( ) .  ( ) select important features from the new input (output of vanilla RNN), which comprise another part of the current output)  ( ) controls what features in ( ) to pass to output ( ) . 2019-09-26 176Machine learning and artificial neural network
  • 177. RNN: LSTM  LSTM operation  Gating function: ( ) ( ) ( ) (forget gate) ( ) ( ) ( ) (input gate) ( ) ( ) ( ) (output gate)  Parameters of three gates ( , , , , , ) are obtained through BPTT, too.  i.e., LSTM learns from the data what features to select from ( ) (long-term memory) and from ( ) ( ) .  Also it learns what features in ( ) to pass to the final output ( ) . 2019-09-26 177Machine learning and artificial neural network
  • 178. RNN: Building RNN/LSTM model  Unfolded RNN/LSTM model  You can add NN layer on top of RNN/LSTM cell. 2019-09-26 178Machine learning and artificial neural network RNN/LSTM cell K RNN/LSTM cell K-1 RNN/LSTM cell 2 RNN/LSTM cell 1 y(t) y(t-1) y(t-K+1) y(t-K) x(t) x(t-1) x(t-K+1) x(t-K) DD D Computer Lab.  Practice 1: ML_practice4_RNN_seq_pred.ipynb  Practice 2: ML_practice5_RNN_hihello.ipynb
  • 179. Machine Learning and Neural Network Ch.11: Convolutional neural network (CNN) Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 180. Ch.11 Convolutional neural network  Topics 1. Features of CNN 2. CNN Model  Convolution sublayer  Activation function sublayer  Pooling sublayer 3. CNN Training 2019-09-26 180Machine learning and artificial neural network
  • 181. Roadmap 2019-09-26 181Machine learning and artificial neural network Ch.4 Linear Regression Ch.6 Ridge/Lasso regression Ch.7 Logistic regression Ch.8 Multi-task regression Ch.9 Neural Network Ch.10 Recurrent NN Ch.11 Convolutional NN D D D x(t) ŷ(t) h(t) h(t-1) Layer 2 Layer 1
  • 182. CNN: Convolutional neural network  Image/Vision classification and object detection  An image has 2D(matrix) or 3D(tensor) structure (i.e., RGB)  Information is contained in a pixel, an element of a matrix (2D image) or a tensor (2D images for RGB or 2D images captured with 2 cameras).  Nearby pixels (values) are highly correlated  patterns in an image can be identified by the correlations between nearby pixels  nearby pixels must be processed as a chunk  Identifying patterns in an image is “translation invariant” and “size invariant”. (we can identify same patterns wherever it is located and whatever its size is).  Sometimes, it should also be rotation invariant. 2019-09-26 182Machine learning and artificial neural network
  • 183. CNN: Convolutional neural network  CNN for image/vision data  CNN is a special NN designed for image/vision data.  Can be used for image classification, object detection, depth estimation, etc.  It processes a chunk of nearby pixels simultaneously (receptive field)  Will see how it provides object (pattern) detection with translation invariance.  Size invariance can be provided by multi-layer structure  Rotation invariance? 2019-09-26 183Machine learning and artificial neural network
  • 184. CNN Model  (example) configuration of CNN  Two convolution NN layers and 3 (FC) NN layers.  Convolution NN layers are divided into sublayers: convolution sublayer (denoted by CX) and pooling sublayer (denoted by SX)  (FC) NNs layers are C5, F6 and output (C5 looks like FC NN) 2019-09-26 184Machine learning and artificial neural network source: Proc. of IEEE, Nov. 1998 by Y. LeCun, et.al.
  • 185. CNN Model – Convolution layer  CNN model (convolution layer)  For convenience, we divide it into 3 sublayers.  Convolution sublayer  Activation function sublayer  Pooling sublayer  Activation function sublayer is the same as in conventional NN  Dropout can also be applied as in fully connected NN layer 2019-09-26 185Machine learning and artificial neural network
  • 186. CNN Model – Convolution layer  CNN model (convolution layer)  Conventional NN layer has 1-dimensional array of neurons, while CNN layer has 3-dimensional array (width, height and depth), where depth index is called “channel”  Input to a CNN layer is also 3-dimensional, e.g., 2D images with RGB (3 channels)  Denote 3-d input and output of th CNN layer as ( ) and ( ) , where and are channel index. 2019-09-26 186Machine learning and artificial neural network
  • 187. CNN Model – Convolution layer  CNN model (convolution layer)  Operations of three sublayers are  Convolution sublayer ------------: ( ) ( ) ( )( )  Activation function sublayer ---: ( ) ( )  Pooling sublayer -----------------: ( ) ( )  Input and output size are the same only for AF sublayer. Other two sublayer has different input and output size. 2019-09-26 187Machine learning and artificial neural network
  • 188. CNN Model – Convolution layer  Convolution sublayer  ( ) ( ) ( )( )  ( ) is weight matrix (filter) between th channel of the input and th channel of the output.  is 2-d convolution, with which th element of ( ) is given by , ( ) , ( ) , ( ) ( , )∈ ( , )  is the “receptive field” of th neuron 2019-09-26 188Machine learning and artificial neural network 2-d array of neurons of jth output channel 2-d array of input signal of ith input channel
  • 189. CNN Model – Convolution layer  Convolution sublayer  Each filter responses to a certain pattern within a receptive fields on the input.  Filter examples: three filters of size 5x5 response to different patterns (diamond, T and diagonal, respectively)  The filter coefficients are obtained through CNN training and, in general, they are real values. 2019-09-26 189Machine learning and artificial neural network
  • 190. CNN Model – Convolution layer  Convolution sublayer example (2 input ch., 3 output ch.)  All the neurons of a channel share the same weight matrices  A channel (2D array) is a feature map containing information of a (combination of) specific pattern(s) defined by weight matrices; (information on location and existence) 2019-09-26 190Machine learning and artificial neural network
  • 191. CNN Model – Convolution layer  Convolution sublayer  Configuration parameters • : stride, ( ) : size of the weight matrix (2-d) • ( ), ( ): size of input and output (3-d)  , ( ) and ( ) must be set properly  The number of weight matrices (filters) to train is  In general, , , while , 2019-09-26 191Machine learning and artificial neural network 2-d array of neurons of jth output channel 2-d array of input signal of ith input channel
  • 192. CNN Model – Convolution layer  Activation function sublayer  ( ) ( )  Output of convolution sublayer, ( ) , is then passed through an activation function.  ReLU or leaky ReLU are typically used.  The output ( ) has the same size as that of the input. 2019-09-26 192Machine learning and artificial neural network
  • 193. CNN Model – Convolution layer  Pooling sublayer  ( ) ( )  Pooling sublayer down sample the sublayer input, ( ) .  While doing so, it also summarizes the data too.  Let be the down-sample ratio. Each channel of input is partitioned into areas (pooling area), in which array of numbers are summarized into a scalar.  Two types: max-pooing (takes maximum value) and average- pooling (take average of values)  output size is of input size 2019-09-26 193Machine learning and artificial neural network
  • 194. CNN Model – Convolution layer  Pooling sublayer  Pooling operation can be expressed as , ( ) ( , )∈ ( , ) , ( ) , ( ) ( , )∈ ( , )  is the pooling area of th output.  Pooling reduces computational burden, e.g., with , the number of parameters to train is reduced by ¼.  If is too large, however, important information can be lost.  Better to apply pooling multiple times with small . 2019-09-26 194Machine learning and artificial neural network
  • 195. CNN training  CNN training  The parameters to optimize is the weight matrices, 𝑾 ( ) ’s for ,  Similar to conventional NN, we apply chain rule to compute gradient w.r.t. ( )  Differences from conventional NN 1. 3-D (cubic) arrays of neurons 2. partial connection & weight sharing in conv. sublayer 3. passing gradient through pooling sublayer.  See textbook section 11.3 for detail 2019-09-26 195Machine learning and artificial neural network
  • 196. CNN training  Improving performance of CNN  Apply dropout to avoid co-adaptation between channels  Data normalization: adjust mean (brightness) and variance (contrast) of image to make them fall within predefined ranges  Batch normalization: normalize data for each batch at each layer  Data augmentation: increase data set by resizing and/or rotating the original image  size/rotation invariance 2019-09-26 196Machine learning and artificial neural network
  • 197.  Practice: ML_practice6_CNN_190820.ipynb 2019-09-26 197Machine learning and artificial neural network Computer Lab.
  • 198. Machine Learning and Neural Network Ch.12/13: Unsupervised learning: Clustering and data visualization Seokhyun Yoon, Electronics Eng., Dankook Uinversity
  • 199. Ch.12/13 Clustering and data visualization  Topics 1. Clustering  Partitioning (centroid) based clustering: k-means algorithm  Hierarchical (connectivity based) clustering and dendrogram  Density based clustering  Distribution based clustering 2. EM algorithm for Gaussian Mixture Model (Ch.13) 3. Data visualization using non-linear mapping: t-SNE 2019-09-26 199Machine learning and artificial neural network
  • 200. Clustering and data visualization  Clustering  Data without label: where  Objective is to divide data into a set of groups based on some similarity measures  Need to devise procedures to efficiently group data  Data (distribution) visualization to check clusters  Typical similarity measures:  Euclidian distance:  Correlation: 𝒙 𝒙 𝒙 𝒙 2019-09-26 200Machine learning and artificial neural network
  • 201. Clustering and data visualization  Four approaches to clustering  Partitioning (centroid) based clustering: k-means  Hierarchical (connectivity based) clustering  Density based clustering  Distribution base clustering: Gaussian Mixture Model and EM algorithm (ch.13) 2019-09-26 201Machine learning and artificial neural network
  • 202. Partitioning (centroid) based Clustering: k-means  Partitioning (centroid) based clustering 2019-09-26 202Machine learning and artificial neural network  The feature space is partitioned into Voronoi regions, where each region is represented by a centroid.  Based on Euclidian distance measure, the points in a Voronoi region are those closest to that centroid  k-means (Lloyd) algorithm searches for centroids for pre-defined number of regions to partition.
  • 203. Partitioning (centroid) based Clustering: k-means  K-means clustering (Lloyd’s algorithm)  Input: , : the number of clusters to find  Initialization: randomly select samples to use them as centroids, 1) Determine class members :  Set  For all samples , do  ∗ ∈{ , ,…, } ∗ ∗ 2) Update centroid: | | ∈ (mean of its members)  Repeat 1) and 2) many times until doesn’t change any more  Output: , cluster label for all 2019-09-26 203Machine learning and artificial neural network
  • 204. Partitioning (centroid) based Clustering: k-means  Partitioning (centroid) based clustering  K-means algorithm has been originally proposed for vector quantization  The clusters found can be quite different from our expectation, especially when the size of the true clusters are quite different 2019-09-26 204Machine learning and artificial neural network Source: https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm
  • 205. Hierarchical (connectivity based) clustering  Hierarchical clustering  cluster hierarchy represented by a dendrogram (a binary tree representing similarity between clusters).  In the tree, a node is a cluster and a leaf node is a sample  Two approaches to build dendrogram: top-down (divisive) or bottom-up (agglomerative) 2019-09-26 205Machine learning and artificial neural network Node Leaf node Root node Samples, labelled by (BRCA) tumor category Features [gene name]
  • 206. Hierarchical (connectivity based) clustering  Bottom-up (agglomerative) approach  Initially, each sample is set as a cluster (leaf node) having only one member. 1) Compute “inter-cluster distances” for every pair of clusters (nodes without parent). 2) Select the pair with smallest distance and merge them to one. (add a node in the tree connecting the two nodes)  Repeat 1) and 2) until only one cluster is left 2019-09-26 206Machine learning and artificial neural network Source: https://www.researchgate.net/ publication/273456906_Cluster _Analysis_to_Understand_Socio -Ecological_Systems_A_Guideline /figures?lo=1
  • 207. Hierarchical (connectivity based) clustering  Bottom-up (agglomerative) approach 2019-09-26 207Machine learning and artificial neural network  Inter-cluster distances: distance between two clusters  It can be defined as  minimum (single-linkage)  average (average-linkage)  maximum (complete-linkage)  of distances between every pair of members (one from each cluster)
  • 208. Hierarchical (connectivity based) clustering  Top-down (divisive) approach  Initially, we have only one cluster having all the sample as its members. (root node) 1) Select the cluster having highest “intra-cluster distance” (for example) 2) Apply k-means clustering to divide it into two.  Repeat 1) and 2) until every cluster have only one member.  Another name of this is “hierarchical k-means” 2019-09-26 208Machine learning and artificial neural network
  • 209. Density based clustering  Density based clustering  A cluster is defined as a set of samples that lie within a relatively dense area.  Clusters are divided by sparse area.  Useful when clusters are not centralized (not radially distributed)  Two well known algorithm: DBSCAN and OPTICS 2019-09-26 209Machine learning and artificial neural network Source: https://untitledtblog.tistory.com/146
  • 210. Density based clustering  Density based clustering: DBSCAN  Two parameters: (dist. threshold) and (# of points)  Definition (core point): It is a point from which there are at least points within a distance .  First, divide all the points into core and non-core points.  Assign cluster # to core points 1) Select a core point x of which cluster is not assigned yet. 2) Find all the core points that can be connected within a distance to each other  assign a cluster # to these core point(s) 3) Repeat 1) and 2) to find all the core point clusters  Assign cluster # to non-core points 1) For all non-core points, find the closest core point within the distance and set its cluster to the cluster # of that core point. 2) If there is no core point within , it is simply regarded as outlier. 2019-09-26 210Machine learning and artificial neural network
  • 211. Distribution based clustering  Distribution based clustering: Mixture model  Use a PDF model (with parameters) to approximate probability distribution of clusters  The data distribution is modelled by a mixture of the PDFs  A well-known, mathematically tractable one is Gaussian mixture model (GMM), of which the data distribution is modelled by where is cluster index and is the number of clusters  The objective is to find optimal model parameters , for that best fit to given data set. 2019-09-26 211Machine learning and artificial neural network
  • 212. Gaussian mixture model & EM algorithm  Gaussian mixture model | where is cluster index and is the number of clusters is a latent variable (은닉 변수)  The objective is to find optimal model parameters , for that best fit to given data set.  Issues  May use likelihood as objective function | ( consists of { }’s)  Not easy to maximize as contains summation due to the latent variable 2019-09-26 212Machine learning and artificial neural network
  • 213. Gaussian mixture model & EM algorithm  EM algorithm (in general)  Use conditional likelihood given , i.e., assume (the cluster of each sample ) is fixed  Define conditional likelihood |  With this, we iteratively find and  Steps  Initialize (𝟎) and do the following while not converge 1) E-step: 𝒛|𝑿,𝜽 𝒕 𝒛 𝒕 2) M-step: ( ) 𝜽 ( ) 2019-09-26 213Machine learning and artificial neural network
  • 214. Gaussian mixture model & EM algorithm  EM algorithm for Gaussian mixture model  Conditional likelihood: Steps  Input:  Initialize (𝟎)  Do the following while not converge 1) E-step: ( ) 𝒙 ; 𝝁 ( ) ,𝑪 ( ) ∑ 𝒙 ; 𝝁 ( ) ,𝑪 ( ) 2) M-step: ( ) ∑ ( ) ∑ ∑ ( ) ( ) ∑ ( ) 𝒙 ∑ ( ) ( ) ∑ ( ) 𝒙 𝝁 𝒙 𝝁 ∑ ( ) (See textbook section 13.2 for detail) 2019-09-26 214Machine learning and artificial neural network
  • 215. Gaussian mixture model & EM algorithm  Clustering with GMM  Note  The number of clusters must be fixed a priori.  Variational EM can find a good number for implicitly.  See “C. M. Bishop, Pattern Recognition and Machine Learning, Springer” for variational EM 2019-09-26 215Machine learning and artificial neural network Source: https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm
  • 217. Non-linear feature dimension reduction: t-SNE  Data (distribution) visualization  Data visualization gives us a lot of information on the data, its shape of distributions, the number of separable clusters, and so on.  One can also check if clustering is done properly and if there is any outliers or not.  Linear dimension reduction (PCA) is effective if the number of clusters or the original feature dimension is small enough.  We discuss a non-linear dimension reduction technique, t-distributed stochastic neighbor embedding (t-SNE). 2019-09-26 217Machine learning and artificial neural network
  • 218. Non-linear feature dimension reduction: t-SNE  Requirement in general  points close to each other in the original space must also be close together in the new (low dimensional) space.  The local structure (manifolds) in the original space is kept in the new space with as little distortion as possible.  Characteristics of t-SNE  It’s a non-linear mapping  Direct mapping: x in the original space  z in the new space, obtained by solving an optimization problem  If some new data is added, we need to perform optimization again and the new mapping will be different from the previous one.  An upgraded version of SNE 2019-09-26 218Machine learning and artificial neural network
  • 219. Non-linear feature dimension reduction: t-SNE  Elements  Pairwise similarity in the original space:  Pairwise similarity in the new space:  Cost function:  Definition  Given data points , let be the point wise mapping of in the new space.  𝒙 𝒙 / ∑ 𝒙 𝒙 /, :  𝒛 𝒛 ∑ 𝒛 𝒛, : 𝒛 𝒛 ∑ 𝒛 𝒛  Both and are valid PMFs. 2019-09-26 219Machine learning and artificial neural network
  • 220. Non-linear feature dimension reduction: t-SNE  Cost function: Kullback-Leibler divergence(KLD)  Cost = KLD between and (𝒁), :  and hold if and are valid PMFs  Optimization  We want to find that minimize .  Apply gradient descent, for which the gradient of w.r.t. is given by 𝒛 (𝒁) 𝒛 𝒛  More tricks were applied (see the original paper) 2019-09-26 220Machine learning and artificial neural network
  • 221. Non-linear feature dimension reduction: t-SNE  Note 𝒛 (𝒁) 𝒛 𝒛  Let be the original space and be the new space  (direction of movement of ) = in is either toward or the opposite  The sign is determined by , i.e., it is toward if (similarity in < that in or distance in > that in )  The actual movement is given by the sum for all  make and as close as possible  𝒛 𝒛 can be regarded as the rate of movement  The rate of movement is large if and are close together and vice versa  try to keep focused on local structure 2019-09-26 221Machine learning and artificial neural network
  • 222. Non-linear feature dimension reduction: t-SNE  Comparison: PCA versus t-SNE  400 dimensional features mapped to 2-dimensional features 2019-09-26 222Machine learning and artificial neural network
  • 223. Non-linear feature dimension reduction: t-SNE  Perplexity: setting  Perplexity is defined for a point as ( ) , where | | with | 𝒙 𝒙 / ∑ 𝒙 𝒙 /  We make the perplexity roughly the same , i.e.,  set smaller in dense region (many points nearby)  set larger in sparse region (few points nearby)  In this way, the effective number of points nearby is made roughly the same  Binary search can be used to find  Typical value of perplexity is 5~50 2019-09-26 223Machine learning and artificial neural network
  • 224.  Practice: ML_practice7_clustering.ipynb 2019-09-26 224Machine learning and artificial neural network Computer Lab.