Terry Taewoong Um (terry.t.um@gmail.com)
University of Waterloo
Department of Electrical & Computer Engineering
Terry Taewoong Um
INTRODUCTION TO
MACHINE LEARNING
AND DEEP LEARNING
1
T-robotics.blogspot.com
Facebook.com/TRobotics
Terry Taewoong Um (terry.t.um@gmail.com)
CAUTION
• I cannot explain everything
• You cannot get every details
2
• Try to get a big picture
• Get some useful keywords
• Connect with your research
Terry Taewoong Um (terry.t.um@gmail.com)
CONTENTS
1. What is Machine Learning?
2. What is Deep Learning?
3
Terry Taewoong Um (terry.t.um@gmail.com)
CONTENTS
4
1. What is Machine Learning?
Terry Taewoong Um (terry.t.um@gmail.com)
WHAT IS MACHINE LEARNING?
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E“ – T. Michell (1997)
Example: A program for soccer tactics
5
T : Win the game
P : Goals
E : (x) Players’ movements
(y) Evaluation
Terry Taewoong Um (terry.t.um@gmail.com)
WHAT IS MACHINE LEARNING?
6
“Toward learning robot table tennis”, J. Peters et al. (2012)
https://youtu.be/SH3bADiB7uQ
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E“ – T. Michell (1997)
Terry Taewoong Um (terry.t.um@gmail.com)
TASKS
7
classification
discrete target values
x : pixels (28*28)
y : 0,1, 2,3,…,9
regression
real target values
x ∈ (0,100)
y : 0,1, 2,3,…,9
clustering
no target values
x ∈ (-3,3)×(-3,3)
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E“ – T. Michell (1997)
Terry Taewoong Um (terry.t.um@gmail.com)
PERFORMANCE
8
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E“ – T. Michell (1997)
classification
0-1 loss function
regression
L2 loss function
clustering
Terry Taewoong Um (terry.t.um@gmail.com)
EXPERIENCE
9
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E“ – T. Michell (1997)
classification
labeled data
(pixels)→(number)
regression
labeled data
(x) → (y)
clustering
unlabeled data
(x1,x2)
Terry Taewoong Um (terry.t.um@gmail.com)
A TOY EXAMPLE
10
? Height(cm)
Weight
(kg)
[Input X]
[Output Y]
Terry Taewoong Um (terry.t.um@gmail.com)
11
180 Height(cm)
Weight
(kg)
80
Y = aX+b
Model : Y = aX+b Parameter : (a, b)
[Goal] Find (a,b) which best fits the given data
A TOY EXAMPLE
Terry Taewoong Um (terry.t.um@gmail.com)
12
[Analytic Solution]
Least square problem
(from AX = b, X=A#b where
A# is A’s pseudo inverse)
Not always available
[Numerical Solution]
1. Set a cost function
2. Apply an optimization method
(e.g. Gradient Descent (GD) Method)
L
(a,b)
http://www.yaldex.com/game-
development/1592730043_ch18lev1sec4.html
Local minima problem
http://mnemstudio.org/neural-networks-
multilayer-perceptron-design.htm
A TOY EXAMPLE
Terry Taewoong Um (terry.t.um@gmail.com)
13
32 Age(year)
Running
Record
(min)
140
WHAT WOULD BE THE CORRECT MODEL?
Select a model → Set a cost function → Optimization
Terry Taewoong Um (terry.t.um@gmail.com)
14
? X
Y
WHAT WOULD BE THE CORRECT MODEL?
1. Regularization 2. Nonparametric model
“overfitting”
Terry Taewoong Um (terry.t.um@gmail.com)
15
L2 REGULARIZATION
(e.g. w=(a,b) where Y=aX+b)
Avoid a complicated model!
• Another interpretation :
: Maximum a Posteriori (MAP)
http://goo.gl/6GE2ix
http://goo.gl/6GE2ix
Terry Taewoong Um (terry.t.um@gmail.com)
16
WHAT WOULD BE THE CORRECT MODEL?
1. Regularization 2. Nonparametric model
training time
error
training error
test error
we should
stop here
training
set
validation
set
test
set
for training
(parameter
optimization)
for early
stopping
(avoid
overfitting)
for evaluation
(measure the
performance)
keep watching the validation error
Terry Taewoong Um (terry.t.um@gmail.com)
17
NONPARAMETRIC MODEL
• It does not assume any parametric models (e.g. Y = aX+b, Y=aX2+bX+c, etc.)
• It often requires much more samples
• Kernel methods are frequently applied for modeling the data
• Gaussian Process Regression (GPR), a sort of kernel method, is a widely-used
nonparametric regression method
• Support Vector Machine (SVM), also a sort of kernel method, is a widely-used
nonparametric classification method
kernel function
[Input space] [Feature space]
Terry Taewoong Um (terry.t.um@gmail.com)
18
SUPPORT VECTOR MACHINE (SVM)
“Myo”, Thalmic Labs (2013)
https://youtu.be/oWu9TFJjHaM
[Linear classifiers] [Maximum margin]
Support vector Machine Tutorial, J. Weston, http://goo.gl/19ywcj
[Dual formulation] ( )
kernel function
kernel function
Terry Taewoong Um (terry.t.um@gmail.com)
19
GAUSSIAN PROCESS REGRESSION (GPR)
https://youtu.be/YqhLnCm0KXY
https://youtu.be/kvPmArtVoFE
• Gaussian Distribution
• Multivariate regression likelihood
posterior
prior
likelihood
prediction conditioning the joint distribution of the observed & predicted values
https://goo.gl/EO54WN
http://goo.gl/XvOOmf
Terry Taewoong Um (terry.t.um@gmail.com)
20
DIMENSION REDUCTION
[Original space] [Feature space]
low dim. high dim.
high dim. low dim.
𝑋 → ∅(𝑋)
• Principal Component Analysis
: Find the best orthogonal axes
(=principal components) which
maximize the variance of the data
Y = P X
* The rows in P are m largest eigenvectors
of
1
𝑁
𝑋𝑋 𝑇
(covariance matrix)
Terry Taewoong Um (terry.t.um@gmail.com)
21
DIMENSION REDUCTION
http://jbhuang0604.blogspot.kr/2013/04/miss-korea-2013-contestants-face.html
Terry Taewoong Um (terry.t.um@gmail.com)
22
SUMMARY - PART 1
• Machine Learning
- Tasks : Classification, Regression, Clustering, etc.
- Performance : 0-1 loss, L2 loss, etc.
- Experience : labeled data, unlabelled data
• Machine Learning Process
(1) Select a parametric / nonparametric model
(2) Set a performance measurement including regularization term
(3) Training data (optimizing parameters) until validation error increases
(4) Evaluate the final performance using test set
• Nonparametric model : Support Vector Machine, Gaussian Process Regression
• Dimension reduction : used as pre-processing data

기계학습(Machine learning) 입문하기

  • 1.
    Terry Taewoong Um(terry.t.um@gmail.com) University of Waterloo Department of Electrical & Computer Engineering Terry Taewoong Um INTRODUCTION TO MACHINE LEARNING AND DEEP LEARNING 1 T-robotics.blogspot.com Facebook.com/TRobotics
  • 2.
    Terry Taewoong Um(terry.t.um@gmail.com) CAUTION • I cannot explain everything • You cannot get every details 2 • Try to get a big picture • Get some useful keywords • Connect with your research
  • 3.
    Terry Taewoong Um(terry.t.um@gmail.com) CONTENTS 1. What is Machine Learning? 2. What is Deep Learning? 3
  • 4.
    Terry Taewoong Um(terry.t.um@gmail.com) CONTENTS 4 1. What is Machine Learning?
  • 5.
    Terry Taewoong Um(terry.t.um@gmail.com) WHAT IS MACHINE LEARNING? "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E“ – T. Michell (1997) Example: A program for soccer tactics 5 T : Win the game P : Goals E : (x) Players’ movements (y) Evaluation
  • 6.
    Terry Taewoong Um(terry.t.um@gmail.com) WHAT IS MACHINE LEARNING? 6 “Toward learning robot table tennis”, J. Peters et al. (2012) https://youtu.be/SH3bADiB7uQ "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E“ – T. Michell (1997)
  • 7.
    Terry Taewoong Um(terry.t.um@gmail.com) TASKS 7 classification discrete target values x : pixels (28*28) y : 0,1, 2,3,…,9 regression real target values x ∈ (0,100) y : 0,1, 2,3,…,9 clustering no target values x ∈ (-3,3)×(-3,3) "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E“ – T. Michell (1997)
  • 8.
    Terry Taewoong Um(terry.t.um@gmail.com) PERFORMANCE 8 "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E“ – T. Michell (1997) classification 0-1 loss function regression L2 loss function clustering
  • 9.
    Terry Taewoong Um(terry.t.um@gmail.com) EXPERIENCE 9 "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E“ – T. Michell (1997) classification labeled data (pixels)→(number) regression labeled data (x) → (y) clustering unlabeled data (x1,x2)
  • 10.
    Terry Taewoong Um(terry.t.um@gmail.com) A TOY EXAMPLE 10 ? Height(cm) Weight (kg) [Input X] [Output Y]
  • 11.
    Terry Taewoong Um(terry.t.um@gmail.com) 11 180 Height(cm) Weight (kg) 80 Y = aX+b Model : Y = aX+b Parameter : (a, b) [Goal] Find (a,b) which best fits the given data A TOY EXAMPLE
  • 12.
    Terry Taewoong Um(terry.t.um@gmail.com) 12 [Analytic Solution] Least square problem (from AX = b, X=A#b where A# is A’s pseudo inverse) Not always available [Numerical Solution] 1. Set a cost function 2. Apply an optimization method (e.g. Gradient Descent (GD) Method) L (a,b) http://www.yaldex.com/game- development/1592730043_ch18lev1sec4.html Local minima problem http://mnemstudio.org/neural-networks- multilayer-perceptron-design.htm A TOY EXAMPLE
  • 13.
    Terry Taewoong Um(terry.t.um@gmail.com) 13 32 Age(year) Running Record (min) 140 WHAT WOULD BE THE CORRECT MODEL? Select a model → Set a cost function → Optimization
  • 14.
    Terry Taewoong Um(terry.t.um@gmail.com) 14 ? X Y WHAT WOULD BE THE CORRECT MODEL? 1. Regularization 2. Nonparametric model “overfitting”
  • 15.
    Terry Taewoong Um(terry.t.um@gmail.com) 15 L2 REGULARIZATION (e.g. w=(a,b) where Y=aX+b) Avoid a complicated model! • Another interpretation : : Maximum a Posteriori (MAP) http://goo.gl/6GE2ix http://goo.gl/6GE2ix
  • 16.
    Terry Taewoong Um(terry.t.um@gmail.com) 16 WHAT WOULD BE THE CORRECT MODEL? 1. Regularization 2. Nonparametric model training time error training error test error we should stop here training set validation set test set for training (parameter optimization) for early stopping (avoid overfitting) for evaluation (measure the performance) keep watching the validation error
  • 17.
    Terry Taewoong Um(terry.t.um@gmail.com) 17 NONPARAMETRIC MODEL • It does not assume any parametric models (e.g. Y = aX+b, Y=aX2+bX+c, etc.) • It often requires much more samples • Kernel methods are frequently applied for modeling the data • Gaussian Process Regression (GPR), a sort of kernel method, is a widely-used nonparametric regression method • Support Vector Machine (SVM), also a sort of kernel method, is a widely-used nonparametric classification method kernel function [Input space] [Feature space]
  • 18.
    Terry Taewoong Um(terry.t.um@gmail.com) 18 SUPPORT VECTOR MACHINE (SVM) “Myo”, Thalmic Labs (2013) https://youtu.be/oWu9TFJjHaM [Linear classifiers] [Maximum margin] Support vector Machine Tutorial, J. Weston, http://goo.gl/19ywcj [Dual formulation] ( ) kernel function kernel function
  • 19.
    Terry Taewoong Um(terry.t.um@gmail.com) 19 GAUSSIAN PROCESS REGRESSION (GPR) https://youtu.be/YqhLnCm0KXY https://youtu.be/kvPmArtVoFE • Gaussian Distribution • Multivariate regression likelihood posterior prior likelihood prediction conditioning the joint distribution of the observed & predicted values https://goo.gl/EO54WN http://goo.gl/XvOOmf
  • 20.
    Terry Taewoong Um(terry.t.um@gmail.com) 20 DIMENSION REDUCTION [Original space] [Feature space] low dim. high dim. high dim. low dim. 𝑋 → ∅(𝑋) • Principal Component Analysis : Find the best orthogonal axes (=principal components) which maximize the variance of the data Y = P X * The rows in P are m largest eigenvectors of 1 𝑁 𝑋𝑋 𝑇 (covariance matrix)
  • 21.
    Terry Taewoong Um(terry.t.um@gmail.com) 21 DIMENSION REDUCTION http://jbhuang0604.blogspot.kr/2013/04/miss-korea-2013-contestants-face.html
  • 22.
    Terry Taewoong Um(terry.t.um@gmail.com) 22 SUMMARY - PART 1 • Machine Learning - Tasks : Classification, Regression, Clustering, etc. - Performance : 0-1 loss, L2 loss, etc. - Experience : labeled data, unlabelled data • Machine Learning Process (1) Select a parametric / nonparametric model (2) Set a performance measurement including regularization term (3) Training data (optimizing parameters) until validation error increases (4) Evaluate the final performance using test set • Nonparametric model : Support Vector Machine, Gaussian Process Regression • Dimension reduction : used as pre-processing data