SlideShare a Scribd company logo
1 of 23
CSCE-421 Machine Learning
13. Kernel Machines
Instructor: Guni Sharon, classes: TR 3:55-5:10, HRBB 124
Based on a lecture by Kilian Weinberger and Joaquin Vanschoren 0
Announcements
โ€ข Midterm on Tuesday, November-23 (in class)
โ€ข Due:
โ€ข Quiz 4: ML debugging and kernelization, due Nov 4
โ€ข Assignment (P3): SVM, linear regression and kernelization, due Tuesday Nov-
16
1
Feature Maps
โ€ข Linear models: ๐‘ฆ = ๐‘คโŠค๐‘ฅ = ๐‘– ๐‘ค๐‘–๐‘ฅ๐‘– = ๐‘ค1๐‘ฅ1 + โ‹ฏ + ๐‘ค๐‘๐‘ฅ๐‘
โ€ข When we cannot fit the data well (non-linear), add non-linear
combinations of features
โ€ข Feature map (or basis expansion ) ๐œ™ โˆถ ๐‘‹ โ†’ โ„๐‘‘
๐‘ฆ = ๐‘คโŠค๐‘ฅ โ†’ ๐‘ฆ = ๐‘คโŠค๐œ™(๐‘ฅ)
โ€ข E.g., Polynomial feature map: all polynomials up to degree ๐‘‘
๐œ™[1, ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘] โ†’ [1, ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘, ๐‘ฅ1
2
, โ€ฆ , ๐‘ฅ๐‘
2, โ€ฆ , ๐‘ฅ๐‘
๐‘‘, ๐‘ฅ1๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘–๐‘ฅ๐‘—]
โ€ข Example with p=1,d=3
๐‘ฆ = ๐‘ค1๐‘ฅ1 โ†’ ๐‘ฆ = ๐‘ค1๐‘ฅ1 + ๐‘ค2๐‘ฅ1
2
+ ๐‘ค3๐‘ฅ1
3
2
Realization
โ€ข Computing the transformed dot product ๐œ™ ๐‘ฅ๐‘–
โŠค ๐œ™(๐‘ฅ๐‘—) for all
observation pairs ๐‘–, ๐‘— is efficient with a kernel function
โ€ข Even when mapping to an infinite feature space (RBF)
โ€ข Two requirements
1. Predict based on ๐œ™ ๐‘ฅ๐‘–
โŠค ๐œ™(๐‘ฅ๐‘—). Donโ€™t rely on ๐‘คโŠค ๐œ™ ๐‘ฅ๐‘–
2. Train (๐›ผ) based on ๐œ™ ๐‘ฅ๐‘–
โŠค
๐œ™(๐‘ฅ๐‘—). Donโ€™t train ๐‘ค
โ€ข ๐‘ค = ๐‘– ๐›ผ๐‘–๐œ™(๐‘ฅ๐‘–)
โ€ข ๐‘คโŠค๐œ™(๐‘ง) = ๐‘– ๐›ผ๐‘–๐œ™(๐‘ฅ๐‘–) โŠค๐œ™(๐‘ง) = ๐‘– ๐›ผ๐‘–๐œ™ ๐‘ฅ๐‘–
โŠค๐œ™(๐‘ง)
3
Using kernels in ML
โ€ข In order to use kernels in ML algorithms we need to show that we can
train and predict using inner products of the observations
โ€ข Then, we can simply swap the inner product with the kernel function
โ€ข For example, kernelizing (Euclidian) 1 nearest neighbors is
straightforward
โ€ข Training: none
โ€ข Predicting: โ„Ž ๐‘ฅ = ๐‘ฆ ๐‘ฅ๐‘ก : ๐‘ฅ๐‘ก = arg min
๐‘ฅโˆˆ๐ท
๐‘ฅ โˆ’ ๐‘ฅ๐‘ก 2 = arg min
๐‘ฅโˆˆ๐ท
๐‘ฅ โˆ’ ๐‘ฅ๐‘ก 2
2
โ€ข = ๐‘ฅโŠค๐‘ฅ โˆ’ 2๐‘ฅโŠค๐‘ฅ๐‘ก + ๐‘ฅ๐‘ก
โŠค
๐‘ฅ๐‘ก
4
๐พ ๐‘ฅ, ๐‘ฅ ๐พ ๐‘ฅ, ๐‘ฅ๐‘ก ๐พ ๐‘ฅ๐‘ก, ๐‘ฅ๐‘ก
Using kernels in ML
โ€ข Ordinary Least Squares:
โ€ข arg๐‘ค min 0.5 ๐‘ฅ๐‘ค โˆ’ ๐‘ฆ 2
โ€ข Squared loss
โ€ข No regularization
โ€ข Closed form: ๐‘ค = ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค๐‘ฆ
โ€ข Closed form: ๐›ผ =?
โ€ข Ridge Regression:
โ€ข arg๐‘ค min 0.5 ๐‘ฅ๐‘ค โˆ’ ๐‘ฆ 2
+ ๐œ† ๐‘ค 2
โ€ข Squared loss
โ€ข ๐‘™2-regularization
โ€ข Closed form: ๐‘ค = ๐‘ฅโŠค
๐‘ฅ + ๐œ†๐ผ โˆ’1
๐‘ฅโŠค
๐‘ฆ
โ€ข Closed form: ๐›ผ =?
5
From ๐‘ค to inner product
โ€ข Claim: the weight vector is always some linear combination of the
training feature vectors: ๐‘ค = ๐‘– ๐›ผ๐‘–๐‘ฅ๐‘– = ๐‘ฅโŠค๐›ผ
โ€ข Was proven last week
6
Kernelizing Ordinary Least Squares
โ€ข min
๐‘ค
๐‘™ = 0.5 ๐‘คโŠค๐‘ฅ โˆ’ ๐‘ฆ 2
โ€ข โˆ‡๐‘ค๐‘™ = ๐‘ฅโŠค ๐‘ฅ๐‘ค โˆ’ ๐‘ฆ = 0 ๐‘‘
โ€ข ๐‘ค = ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค๐‘ฆ
โ€ข ๐‘ฅโŠค๐›ผ = ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค๐‘ฆ
โ€ข ๐‘ฅ๐‘ฅโŠค โˆ’1๐‘ฅ๐‘ฅโŠค๐›ผ = ๐‘ฅ๐‘ฅโŠค โˆ’1๐‘ฅ ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค๐‘ฆ
โ€ข ๐‘ฅ๐‘ฅโŠค โˆ’1๐‘ฅ๐‘ฅโŠค = I
โ€ข ๐‘ฅ ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค = I because ๐‘ฅโŠค๐‘ฅ ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค = ๐‘ฅโŠคI
โ€ข ๐›ผ = ๐‘ฅ๐‘ฅโŠค โˆ’1
๐‘ฆ = ๐‘˜โˆ’1
๐‘ฆ
7
You canโ€™t do thatโ€ฆ
โ€ข You canโ€™t define ๐‘˜๐‘–,๐‘— = ๐‘ฅ๐‘–
โŠค
๐‘ฅ๐‘— and then say that ๐‘˜ = ๐‘ฅ๐‘ฅโŠค
โ€ข Obviously ๐‘ฅโŠค๐‘ฅ โ‰  ๐‘ฅ๐‘ฅโŠค
โ€ข Actually, this is correct. ๐‘ฅ๐‘– is a vector and ๐‘ฅ a matrix. Letโ€™s break it down
โ€ข ๐‘ฅ =
๐‘ฅ1
๐‘ฅ2
โ€ฆ
๐‘ฅ๐‘›
=
๐‘ฅ1,1 โ‹ฏ ๐‘ฅ1,๐‘‘
โ‹ฎ โ‹ฑ โ‹ฎ
๐‘ฅ๐‘›,1 โ‹ฏ ๐‘ฅ๐‘›,๐‘‘
โ€ข ๐‘ฅ๐‘ฅโŠค
=
๐‘ฅ1
โ€ฆ
๐‘ฅ๐‘›
๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› =
๐‘˜1,1 โ‹ฏ ๐‘˜1,๐‘›
โ‹ฎ โ‹ฑ โ‹ฎ
๐‘˜๐‘›,1 โ‹ฏ ๐‘˜๐‘›,๐‘›
8
OK so what is ๐‘ฅโŠค
๐‘ฅ ?
โ€ข ๐‘ฅโŠค๐‘ฅ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘›
๐‘ฅ1
โ€ฆ
๐‘ฅ๐‘›
=
๐น1,1 โ‹ฏ ๐น1,๐‘‘
โ‹ฎ โ‹ฑ โ‹ฎ
๐น๐‘‘,1 โ‹ฏ ๐น๐‘‘,๐‘‘
โ€ข Where ๐น๐‘–,๐‘— = ๐‘ก ๐‘ฅ๐‘ก,๐‘–๐‘ฅ๐‘ก,๐‘—
โ€ข Sanity check: Ordinary Least Squares
โ€ข ๐‘ค = ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค๐‘ฆ = โ„๐‘‘
โ€ข ๐›ผ = ๐‘ฅ๐‘ฅโŠค โˆ’1๐‘ฆ = ๐‘˜โˆ’1๐‘ฆ = โ„๐‘›
9
What about predictions?
โ€ข We can train a kernelized linear (not linear anymore) regression
model
โ€ข ๐›ผ = ๐‘ฅ๐‘ฅโŠค โˆ’1
๐‘ฆ = ๐‘˜โˆ’1
๐‘ฆ = โ„๐‘›
โ€ข Can we use the trained ๐›ผ for prediction? This is our end game!
โ€ข Originally, we had โ„Ž ๐‘ฅ๐‘– = ๐‘คโŠค๐‘ฅ๐‘–
โ€ข But we didnโ€™t train ๐‘ค
โ€ข ๐‘ค = ๐‘– ๐›ผ๐‘–๐‘ฅ๐‘–
โ€ข โ„Ž ๐‘ง = ๐‘– ๐›ผ๐‘–๐‘ฅ๐‘–
โŠค๐‘ง = ๐‘– ๐›ผ๐‘–๐‘ฅ๐‘–
โŠค
๐‘ง
โ€ข This is a linear model with ๐‘› dimentions
10
Kernelized Support Vector Machines
โ€ข min
๐‘ค,๐‘
๐‘คโŠค
๐‘ค + ๐ถ ๐‘– ๐œ‰๐‘–
โ€ข S.T.
โ€ข โˆ€๐‘– ๐‘ฆ๐‘– ๐‘คโŠค๐‘ฅ๐‘– + ๐‘ โ‰ฅ 1 โˆ’ ๐œ‰๐‘–
โ€ข ๐œ‰๐‘– โ‰ฅ 0
โ€ข ๐ถ is a hyper parameter (bias vs variance)
โ€ข Goal: reformulate the optimization problem with inner products and
no ๐‘ค
โ€ข Step 1, define the dual optimization problem
11
Duality principle in optimization
โ€ข Optimization problems may be viewed from
either of two perspectives, the primal problem
or the dual problem
โ€ข The solution to the dual problem provides a
lower bound to the solution of the primal
(minimization) problem
โ€ข For convex optimization problems, the duality
gap is zero under a constraint qualification
condition
12
Duality gap
The dual problem
โ€ข Forming the Lagrangian of a minimization problem by using
nonnegative Lagrange multipliers
โ€ข solving for the primal variable values that maximize the dual problem
(minimize the original objective function)
โ€ข The dual problem gives the primal variables as functions of the
Lagrange multipliers, which are called dual variables, so that the new
problem is to maximize the objective function with respect to the
dual variables + derived constraints
13
Kernelized Support Vector Machines
โ€ข Primal:
โ€ข min
๐‘ค,๐‘
๐‘คโŠค๐‘ค + ๐ถ ๐‘– ๐œ‰๐‘–
โ€ข S.T.
โ€ข โˆ€๐‘– ๐‘ฆ๐‘– ๐‘คโŠค๐‘ฅ๐‘– + ๐‘ โ‰ฅ 1 โˆ’ ๐œ‰๐‘–
โ€ข ๐œ‰๐‘– โ‰ฅ 0
โ€ข Dual:
โ€ข min
๐›ผ1,โ€ฆ,๐›ผ๐‘›
1
2 ๐‘–,๐‘— ๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ๐‘–๐‘ฆ๐‘—๐‘˜๐‘–,๐‘— โˆ’ ๐‘–=1
๐‘›
๐›ผ๐‘–
โ€ข S.T.
โ€ข 0 โ‰ค ๐›ผ๐‘– โ‰ค ๐ถ
โ€ข ๐‘–=1
๐‘›
๐›ผ๐‘– ๐‘ฆ๐‘– = 0
14
We wonโ€™t derive the dual problem
(requires substantial background)
Bottom line: the objective function is
defined as a function of alphas, labels,
and inner products (no weights)
In this case, we can show that
๐‘ค =
๐‘–=1
๐‘›
๐›ผ๐‘–๐‘ฆ๐‘–๐œ™ ๐‘ฅ๐‘–
Where ๐‘ฆ โˆˆ {โˆ’1, +1}
Problem: ๐‘ is not part of the dual
optimization
Kernelized Support Vector Machines
โ€ข For the primal formulation we know (from a previous lecture) that
only support vectors satisfy the constraint with equality:
๐‘ฆ๐‘– ๐‘คโŠค๐œ™ ๐‘ฅ๐‘– + ๐‘ = 1
โ€ข In the dual, these same training inputs can be identified as their
corresponding dual values satisfy ๐›ผ๐‘– > 0 (all other training inputs
have ๐›ผ๐‘– = 0 )
โ€ข In test-time you only need to compute the sum in โ„Ž(๐’™) over the
support vectors and all inputs ๐’™๐‘– with ๐›ผ๐‘– = 0 can be discarded after
training
โ€ข This fact can allow us to compute ๐‘ in closed form
15
Kernelized Support Vector Machines
โ€ข Primal: support vectors have ๐‘ฆ๐‘– ๐‘คโŠค๐œ™ ๐‘ฅ๐‘– + ๐‘ = 1
โ€ข Dual: support vectors have ๐›ผ๐‘– > 0
โ€ข The primal solution and the dual solution are identical
โ€ข As a result, โˆ€๐‘–๐›ผ๐‘–>0 ๐‘ฆ๐‘– ๐‘— ๐‘ฆ๐‘—๐›ผ๐‘—๐‘˜๐‘—,๐‘– + ๐‘ = 1
โ€ข ๐‘ =
1
๐‘ฆ๐‘–
โˆ’ ๐‘— ๐‘ฆ๐‘—๐›ผ๐‘—๐‘˜๐‘—,๐‘– = ๐‘ฆ๐‘– โˆ’ ๐‘— ๐‘ฆ๐‘—๐›ผ๐‘—๐‘˜๐‘—,๐‘–
16
๐‘ฆ โˆˆ โˆ’1, +1 โ†’
1
๐‘ฆ๐‘–
= ๐‘ฆ๐‘–
Kernel SVM = weighted K-NN
โ€ข K-NN with ๐‘ฆ โˆˆ โˆ’1, +1
โ€ข โ„Ž ๐‘ง = sign ๐‘–=1
๐‘›
๐‘ฆ๐‘–๐›ฟ๐‘›๐‘› ๐‘ฅ๐‘–, ๐‘ง
โ€ข ๐›ฟ๐‘›๐‘› ๐‘ฅ๐‘–, ๐‘ง =
1, ๐‘ฅ๐‘– โˆˆ ๐พ โˆ’ Nearest neighbors
0, else
โ€ข Kernel SVM
โ€ข โ„Ž ๐‘ง = sign ๐‘–=1
๐‘›
๐‘ฆ๐‘–๐›ผ๐‘–๐‘˜ ๐‘ฅ๐‘–, ๐‘ง + ๐‘
โ€ข Instead of considering the K nearest neighbors equally, Kernel SVM
considers all neighbors scaled by a distance measure (the kernel) and
a unique learned scale per data point (alpha)
17
18
RBF Kernel =
exp โˆ’
๐‘ฅโˆ’๐‘ง 2
๐œŽ
SVM with soft constraints (C hyperparameter)
19
Kernelized SVM
โ€ข Pros
โ€ข SVM classification can be very efficient, because it uses only a subset of the training
data, only the support vectors
โ€ข Works very well on smaller data sets, on non-linear data sets and high dimensional
spaces
โ€ข Is very effective in cases where number of dimensions is greater than the number
of samples
โ€ข It can have high accuracy, sometimes can perform even better than neural networks
โ€ข Not very sensitive to overfitting
โ€ข Cons
โ€ข Training time is high when we have large data sets
โ€ข When the data set has more noise (i.e. target classes are overlapping) SVM doesnโ€™t
perform well
20
What did we learn?
โ€ข Kernel functions allow us to utilize powerful linear models to predict
non-linear patterns
โ€ข Requires us to represent the linear model through ๐‘ฅ๐‘–
โŠค
๐‘ฅ๐‘— and ๐›ผ while
removing the weight vector ๐‘ค
โ€ข Once we have an appropriate representation, we simply swap ๐‘ฅ๐‘–
โŠค
๐‘ฅ๐‘—
with ๐‘˜๐‘–,๐‘—
What next?
โ€ข Class: Decision trees
โ€ข Assignments:
โ€ข Assignment (P3): SVM, linear regression and kernelization, due Tuesday Nov-
16
โ€ข Quizzes:
โ€ข Quiz 4: ML debugging and kernelization, due Nov 4
22

More Related Content

Similar to 13Kernel_Machines.pptx

super vector machines algorithms using deep
super vector machines algorithms using deepsuper vector machines algorithms using deep
super vector machines algorithms using deepKNaveenKumarECE
ย 
ู…ุฏุฎู„ ุฅู„ู‰ ุชุนู„ู… ุงู„ุขู„ุฉ
ู…ุฏุฎู„ ุฅู„ู‰ ุชุนู„ู… ุงู„ุขู„ุฉู…ุฏุฎู„ ุฅู„ู‰ ุชุนู„ู… ุงู„ุขู„ุฉ
ู…ุฏุฎู„ ุฅู„ู‰ ุชุนู„ู… ุงู„ุขู„ุฉFares Al-Qunaieer
ย 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward NetworksTamer Ahmed Farrag, PhD
ย 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowBarbara Fusinska
ย 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
ย 
ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfTigabu Yaya
ย 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
ย 
Support vector machines
Support vector machinesSupport vector machines
Support vector machinesJinho Lee
ย 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxSeungeon Baek
ย 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMMSanghyuk Chun
ย 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
ย 
Deep neural networks & computational graphs
Deep neural networks & computational graphsDeep neural networks & computational graphs
Deep neural networks & computational graphsRevanth Kumar
ย 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr taeseon ryu
ย 
10_support_vector_machines (1).pptx
10_support_vector_machines (1).pptx10_support_vector_machines (1).pptx
10_support_vector_machines (1).pptxshyedshahriar
ย 
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Maninda Edirisooriya
ย 
Machine Learning on Azure - AzureConf
Machine Learning on Azure - AzureConfMachine Learning on Azure - AzureConf
Machine Learning on Azure - AzureConfSeth Juarez
ย 
Svm algorithm
Svm algorithmSvm algorithm
Svm algorithmWaleed Khan
ย 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksChenYiHuang5
ย 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsSantiagoGarridoBulln
ย 

Similar to 13Kernel_Machines.pptx (20)

super vector machines algorithms using deep
super vector machines algorithms using deepsuper vector machines algorithms using deep
super vector machines algorithms using deep
ย 
ู…ุฏุฎู„ ุฅู„ู‰ ุชุนู„ู… ุงู„ุขู„ุฉ
ู…ุฏุฎู„ ุฅู„ู‰ ุชุนู„ู… ุงู„ุขู„ุฉู…ุฏุฎู„ ุฅู„ู‰ ุชุนู„ู… ุงู„ุขู„ุฉ
ู…ุฏุฎู„ ุฅู„ู‰ ุชุนู„ู… ุงู„ุขู„ุฉ
ย 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
ย 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
ย 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
ย 
ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdf
ย 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
ย 
Lec05.pptx
Lec05.pptxLec05.pptx
Lec05.pptx
ย 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
ย 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
ย 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
ย 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
ย 
Deep neural networks & computational graphs
Deep neural networks & computational graphsDeep neural networks & computational graphs
Deep neural networks & computational graphs
ย 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
ย 
10_support_vector_machines (1).pptx
10_support_vector_machines (1).pptx10_support_vector_machines (1).pptx
10_support_vector_machines (1).pptx
ย 
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
ย 
Machine Learning on Azure - AzureConf
Machine Learning on Azure - AzureConfMachine Learning on Azure - AzureConf
Machine Learning on Azure - AzureConf
ย 
Svm algorithm
Svm algorithmSvm algorithm
Svm algorithm
ย 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
ย 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
ย 

Recently uploaded

18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdfssuser54595a
ย 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
ย 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
ย 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
ย 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
ย 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)Dr. Mazin Mohamed alkathiri
ย 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
ย 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
ย 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
ย 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
ย 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
ย 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
ย 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
ย 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
ย 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
ย 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
ย 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
ย 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
ย 

Recently uploaded (20)

18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf
ย 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
ย 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
ย 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
ย 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
ย 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
ย 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
ย 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
ย 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
ย 
Model Call Girl in Bikash Puri Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
Model Call Girl in Bikash Puri  Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”Model Call Girl in Bikash Puri  Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
Model Call Girl in Bikash Puri Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
ย 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
ย 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
ย 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
ย 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
ย 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
ย 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
ย 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
ย 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
ย 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
ย 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
ย 

13Kernel_Machines.pptx

  • 1. CSCE-421 Machine Learning 13. Kernel Machines Instructor: Guni Sharon, classes: TR 3:55-5:10, HRBB 124 Based on a lecture by Kilian Weinberger and Joaquin Vanschoren 0
  • 2. Announcements โ€ข Midterm on Tuesday, November-23 (in class) โ€ข Due: โ€ข Quiz 4: ML debugging and kernelization, due Nov 4 โ€ข Assignment (P3): SVM, linear regression and kernelization, due Tuesday Nov- 16 1
  • 3. Feature Maps โ€ข Linear models: ๐‘ฆ = ๐‘คโŠค๐‘ฅ = ๐‘– ๐‘ค๐‘–๐‘ฅ๐‘– = ๐‘ค1๐‘ฅ1 + โ‹ฏ + ๐‘ค๐‘๐‘ฅ๐‘ โ€ข When we cannot fit the data well (non-linear), add non-linear combinations of features โ€ข Feature map (or basis expansion ) ๐œ™ โˆถ ๐‘‹ โ†’ โ„๐‘‘ ๐‘ฆ = ๐‘คโŠค๐‘ฅ โ†’ ๐‘ฆ = ๐‘คโŠค๐œ™(๐‘ฅ) โ€ข E.g., Polynomial feature map: all polynomials up to degree ๐‘‘ ๐œ™[1, ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘] โ†’ [1, ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘, ๐‘ฅ1 2 , โ€ฆ , ๐‘ฅ๐‘ 2, โ€ฆ , ๐‘ฅ๐‘ ๐‘‘, ๐‘ฅ1๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘–๐‘ฅ๐‘—] โ€ข Example with p=1,d=3 ๐‘ฆ = ๐‘ค1๐‘ฅ1 โ†’ ๐‘ฆ = ๐‘ค1๐‘ฅ1 + ๐‘ค2๐‘ฅ1 2 + ๐‘ค3๐‘ฅ1 3 2
  • 4. Realization โ€ข Computing the transformed dot product ๐œ™ ๐‘ฅ๐‘– โŠค ๐œ™(๐‘ฅ๐‘—) for all observation pairs ๐‘–, ๐‘— is efficient with a kernel function โ€ข Even when mapping to an infinite feature space (RBF) โ€ข Two requirements 1. Predict based on ๐œ™ ๐‘ฅ๐‘– โŠค ๐œ™(๐‘ฅ๐‘—). Donโ€™t rely on ๐‘คโŠค ๐œ™ ๐‘ฅ๐‘– 2. Train (๐›ผ) based on ๐œ™ ๐‘ฅ๐‘– โŠค ๐œ™(๐‘ฅ๐‘—). Donโ€™t train ๐‘ค โ€ข ๐‘ค = ๐‘– ๐›ผ๐‘–๐œ™(๐‘ฅ๐‘–) โ€ข ๐‘คโŠค๐œ™(๐‘ง) = ๐‘– ๐›ผ๐‘–๐œ™(๐‘ฅ๐‘–) โŠค๐œ™(๐‘ง) = ๐‘– ๐›ผ๐‘–๐œ™ ๐‘ฅ๐‘– โŠค๐œ™(๐‘ง) 3
  • 5. Using kernels in ML โ€ข In order to use kernels in ML algorithms we need to show that we can train and predict using inner products of the observations โ€ข Then, we can simply swap the inner product with the kernel function โ€ข For example, kernelizing (Euclidian) 1 nearest neighbors is straightforward โ€ข Training: none โ€ข Predicting: โ„Ž ๐‘ฅ = ๐‘ฆ ๐‘ฅ๐‘ก : ๐‘ฅ๐‘ก = arg min ๐‘ฅโˆˆ๐ท ๐‘ฅ โˆ’ ๐‘ฅ๐‘ก 2 = arg min ๐‘ฅโˆˆ๐ท ๐‘ฅ โˆ’ ๐‘ฅ๐‘ก 2 2 โ€ข = ๐‘ฅโŠค๐‘ฅ โˆ’ 2๐‘ฅโŠค๐‘ฅ๐‘ก + ๐‘ฅ๐‘ก โŠค ๐‘ฅ๐‘ก 4 ๐พ ๐‘ฅ, ๐‘ฅ ๐พ ๐‘ฅ, ๐‘ฅ๐‘ก ๐พ ๐‘ฅ๐‘ก, ๐‘ฅ๐‘ก
  • 6. Using kernels in ML โ€ข Ordinary Least Squares: โ€ข arg๐‘ค min 0.5 ๐‘ฅ๐‘ค โˆ’ ๐‘ฆ 2 โ€ข Squared loss โ€ข No regularization โ€ข Closed form: ๐‘ค = ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค๐‘ฆ โ€ข Closed form: ๐›ผ =? โ€ข Ridge Regression: โ€ข arg๐‘ค min 0.5 ๐‘ฅ๐‘ค โˆ’ ๐‘ฆ 2 + ๐œ† ๐‘ค 2 โ€ข Squared loss โ€ข ๐‘™2-regularization โ€ข Closed form: ๐‘ค = ๐‘ฅโŠค ๐‘ฅ + ๐œ†๐ผ โˆ’1 ๐‘ฅโŠค ๐‘ฆ โ€ข Closed form: ๐›ผ =? 5
  • 7. From ๐‘ค to inner product โ€ข Claim: the weight vector is always some linear combination of the training feature vectors: ๐‘ค = ๐‘– ๐›ผ๐‘–๐‘ฅ๐‘– = ๐‘ฅโŠค๐›ผ โ€ข Was proven last week 6
  • 8. Kernelizing Ordinary Least Squares โ€ข min ๐‘ค ๐‘™ = 0.5 ๐‘คโŠค๐‘ฅ โˆ’ ๐‘ฆ 2 โ€ข โˆ‡๐‘ค๐‘™ = ๐‘ฅโŠค ๐‘ฅ๐‘ค โˆ’ ๐‘ฆ = 0 ๐‘‘ โ€ข ๐‘ค = ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค๐‘ฆ โ€ข ๐‘ฅโŠค๐›ผ = ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค๐‘ฆ โ€ข ๐‘ฅ๐‘ฅโŠค โˆ’1๐‘ฅ๐‘ฅโŠค๐›ผ = ๐‘ฅ๐‘ฅโŠค โˆ’1๐‘ฅ ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค๐‘ฆ โ€ข ๐‘ฅ๐‘ฅโŠค โˆ’1๐‘ฅ๐‘ฅโŠค = I โ€ข ๐‘ฅ ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค = I because ๐‘ฅโŠค๐‘ฅ ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค = ๐‘ฅโŠคI โ€ข ๐›ผ = ๐‘ฅ๐‘ฅโŠค โˆ’1 ๐‘ฆ = ๐‘˜โˆ’1 ๐‘ฆ 7
  • 9. You canโ€™t do thatโ€ฆ โ€ข You canโ€™t define ๐‘˜๐‘–,๐‘— = ๐‘ฅ๐‘– โŠค ๐‘ฅ๐‘— and then say that ๐‘˜ = ๐‘ฅ๐‘ฅโŠค โ€ข Obviously ๐‘ฅโŠค๐‘ฅ โ‰  ๐‘ฅ๐‘ฅโŠค โ€ข Actually, this is correct. ๐‘ฅ๐‘– is a vector and ๐‘ฅ a matrix. Letโ€™s break it down โ€ข ๐‘ฅ = ๐‘ฅ1 ๐‘ฅ2 โ€ฆ ๐‘ฅ๐‘› = ๐‘ฅ1,1 โ‹ฏ ๐‘ฅ1,๐‘‘ โ‹ฎ โ‹ฑ โ‹ฎ ๐‘ฅ๐‘›,1 โ‹ฏ ๐‘ฅ๐‘›,๐‘‘ โ€ข ๐‘ฅ๐‘ฅโŠค = ๐‘ฅ1 โ€ฆ ๐‘ฅ๐‘› ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› = ๐‘˜1,1 โ‹ฏ ๐‘˜1,๐‘› โ‹ฎ โ‹ฑ โ‹ฎ ๐‘˜๐‘›,1 โ‹ฏ ๐‘˜๐‘›,๐‘› 8
  • 10. OK so what is ๐‘ฅโŠค ๐‘ฅ ? โ€ข ๐‘ฅโŠค๐‘ฅ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› ๐‘ฅ1 โ€ฆ ๐‘ฅ๐‘› = ๐น1,1 โ‹ฏ ๐น1,๐‘‘ โ‹ฎ โ‹ฑ โ‹ฎ ๐น๐‘‘,1 โ‹ฏ ๐น๐‘‘,๐‘‘ โ€ข Where ๐น๐‘–,๐‘— = ๐‘ก ๐‘ฅ๐‘ก,๐‘–๐‘ฅ๐‘ก,๐‘— โ€ข Sanity check: Ordinary Least Squares โ€ข ๐‘ค = ๐‘ฅโŠค๐‘ฅ โˆ’1๐‘ฅโŠค๐‘ฆ = โ„๐‘‘ โ€ข ๐›ผ = ๐‘ฅ๐‘ฅโŠค โˆ’1๐‘ฆ = ๐‘˜โˆ’1๐‘ฆ = โ„๐‘› 9
  • 11. What about predictions? โ€ข We can train a kernelized linear (not linear anymore) regression model โ€ข ๐›ผ = ๐‘ฅ๐‘ฅโŠค โˆ’1 ๐‘ฆ = ๐‘˜โˆ’1 ๐‘ฆ = โ„๐‘› โ€ข Can we use the trained ๐›ผ for prediction? This is our end game! โ€ข Originally, we had โ„Ž ๐‘ฅ๐‘– = ๐‘คโŠค๐‘ฅ๐‘– โ€ข But we didnโ€™t train ๐‘ค โ€ข ๐‘ค = ๐‘– ๐›ผ๐‘–๐‘ฅ๐‘– โ€ข โ„Ž ๐‘ง = ๐‘– ๐›ผ๐‘–๐‘ฅ๐‘– โŠค๐‘ง = ๐‘– ๐›ผ๐‘–๐‘ฅ๐‘– โŠค ๐‘ง โ€ข This is a linear model with ๐‘› dimentions 10
  • 12. Kernelized Support Vector Machines โ€ข min ๐‘ค,๐‘ ๐‘คโŠค ๐‘ค + ๐ถ ๐‘– ๐œ‰๐‘– โ€ข S.T. โ€ข โˆ€๐‘– ๐‘ฆ๐‘– ๐‘คโŠค๐‘ฅ๐‘– + ๐‘ โ‰ฅ 1 โˆ’ ๐œ‰๐‘– โ€ข ๐œ‰๐‘– โ‰ฅ 0 โ€ข ๐ถ is a hyper parameter (bias vs variance) โ€ข Goal: reformulate the optimization problem with inner products and no ๐‘ค โ€ข Step 1, define the dual optimization problem 11
  • 13. Duality principle in optimization โ€ข Optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem โ€ข The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem โ€ข For convex optimization problems, the duality gap is zero under a constraint qualification condition 12 Duality gap
  • 14. The dual problem โ€ข Forming the Lagrangian of a minimization problem by using nonnegative Lagrange multipliers โ€ข solving for the primal variable values that maximize the dual problem (minimize the original objective function) โ€ข The dual problem gives the primal variables as functions of the Lagrange multipliers, which are called dual variables, so that the new problem is to maximize the objective function with respect to the dual variables + derived constraints 13
  • 15. Kernelized Support Vector Machines โ€ข Primal: โ€ข min ๐‘ค,๐‘ ๐‘คโŠค๐‘ค + ๐ถ ๐‘– ๐œ‰๐‘– โ€ข S.T. โ€ข โˆ€๐‘– ๐‘ฆ๐‘– ๐‘คโŠค๐‘ฅ๐‘– + ๐‘ โ‰ฅ 1 โˆ’ ๐œ‰๐‘– โ€ข ๐œ‰๐‘– โ‰ฅ 0 โ€ข Dual: โ€ข min ๐›ผ1,โ€ฆ,๐›ผ๐‘› 1 2 ๐‘–,๐‘— ๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ๐‘–๐‘ฆ๐‘—๐‘˜๐‘–,๐‘— โˆ’ ๐‘–=1 ๐‘› ๐›ผ๐‘– โ€ข S.T. โ€ข 0 โ‰ค ๐›ผ๐‘– โ‰ค ๐ถ โ€ข ๐‘–=1 ๐‘› ๐›ผ๐‘– ๐‘ฆ๐‘– = 0 14 We wonโ€™t derive the dual problem (requires substantial background) Bottom line: the objective function is defined as a function of alphas, labels, and inner products (no weights) In this case, we can show that ๐‘ค = ๐‘–=1 ๐‘› ๐›ผ๐‘–๐‘ฆ๐‘–๐œ™ ๐‘ฅ๐‘– Where ๐‘ฆ โˆˆ {โˆ’1, +1} Problem: ๐‘ is not part of the dual optimization
  • 16. Kernelized Support Vector Machines โ€ข For the primal formulation we know (from a previous lecture) that only support vectors satisfy the constraint with equality: ๐‘ฆ๐‘– ๐‘คโŠค๐œ™ ๐‘ฅ๐‘– + ๐‘ = 1 โ€ข In the dual, these same training inputs can be identified as their corresponding dual values satisfy ๐›ผ๐‘– > 0 (all other training inputs have ๐›ผ๐‘– = 0 ) โ€ข In test-time you only need to compute the sum in โ„Ž(๐’™) over the support vectors and all inputs ๐’™๐‘– with ๐›ผ๐‘– = 0 can be discarded after training โ€ข This fact can allow us to compute ๐‘ in closed form 15
  • 17. Kernelized Support Vector Machines โ€ข Primal: support vectors have ๐‘ฆ๐‘– ๐‘คโŠค๐œ™ ๐‘ฅ๐‘– + ๐‘ = 1 โ€ข Dual: support vectors have ๐›ผ๐‘– > 0 โ€ข The primal solution and the dual solution are identical โ€ข As a result, โˆ€๐‘–๐›ผ๐‘–>0 ๐‘ฆ๐‘– ๐‘— ๐‘ฆ๐‘—๐›ผ๐‘—๐‘˜๐‘—,๐‘– + ๐‘ = 1 โ€ข ๐‘ = 1 ๐‘ฆ๐‘– โˆ’ ๐‘— ๐‘ฆ๐‘—๐›ผ๐‘—๐‘˜๐‘—,๐‘– = ๐‘ฆ๐‘– โˆ’ ๐‘— ๐‘ฆ๐‘—๐›ผ๐‘—๐‘˜๐‘—,๐‘– 16 ๐‘ฆ โˆˆ โˆ’1, +1 โ†’ 1 ๐‘ฆ๐‘– = ๐‘ฆ๐‘–
  • 18. Kernel SVM = weighted K-NN โ€ข K-NN with ๐‘ฆ โˆˆ โˆ’1, +1 โ€ข โ„Ž ๐‘ง = sign ๐‘–=1 ๐‘› ๐‘ฆ๐‘–๐›ฟ๐‘›๐‘› ๐‘ฅ๐‘–, ๐‘ง โ€ข ๐›ฟ๐‘›๐‘› ๐‘ฅ๐‘–, ๐‘ง = 1, ๐‘ฅ๐‘– โˆˆ ๐พ โˆ’ Nearest neighbors 0, else โ€ข Kernel SVM โ€ข โ„Ž ๐‘ง = sign ๐‘–=1 ๐‘› ๐‘ฆ๐‘–๐›ผ๐‘–๐‘˜ ๐‘ฅ๐‘–, ๐‘ง + ๐‘ โ€ข Instead of considering the K nearest neighbors equally, Kernel SVM considers all neighbors scaled by a distance measure (the kernel) and a unique learned scale per data point (alpha) 17
  • 19. 18 RBF Kernel = exp โˆ’ ๐‘ฅโˆ’๐‘ง 2 ๐œŽ
  • 20. SVM with soft constraints (C hyperparameter) 19
  • 21. Kernelized SVM โ€ข Pros โ€ข SVM classification can be very efficient, because it uses only a subset of the training data, only the support vectors โ€ข Works very well on smaller data sets, on non-linear data sets and high dimensional spaces โ€ข Is very effective in cases where number of dimensions is greater than the number of samples โ€ข It can have high accuracy, sometimes can perform even better than neural networks โ€ข Not very sensitive to overfitting โ€ข Cons โ€ข Training time is high when we have large data sets โ€ข When the data set has more noise (i.e. target classes are overlapping) SVM doesnโ€™t perform well 20
  • 22. What did we learn? โ€ข Kernel functions allow us to utilize powerful linear models to predict non-linear patterns โ€ข Requires us to represent the linear model through ๐‘ฅ๐‘– โŠค ๐‘ฅ๐‘— and ๐›ผ while removing the weight vector ๐‘ค โ€ข Once we have an appropriate representation, we simply swap ๐‘ฅ๐‘– โŠค ๐‘ฅ๐‘— with ๐‘˜๐‘–,๐‘—
  • 23. What next? โ€ข Class: Decision trees โ€ข Assignments: โ€ข Assignment (P3): SVM, linear regression and kernelization, due Tuesday Nov- 16 โ€ข Quizzes: โ€ข Quiz 4: ML debugging and kernelization, due Nov 4 22