Difference Between Search & Browse Methods in Odoo 17
ย
13Kernel_Machines.pptx
1. CSCE-421 Machine Learning
13. Kernel Machines
Instructor: Guni Sharon, classes: TR 3:55-5:10, HRBB 124
Based on a lecture by Kilian Weinberger and Joaquin Vanschoren 0
2. Announcements
โข Midterm on Tuesday, November-23 (in class)
โข Due:
โข Quiz 4: ML debugging and kernelization, due Nov 4
โข Assignment (P3): SVM, linear regression and kernelization, due Tuesday Nov-
16
1
3. Feature Maps
โข Linear models: ๐ฆ = ๐คโค๐ฅ = ๐ ๐ค๐๐ฅ๐ = ๐ค1๐ฅ1 + โฏ + ๐ค๐๐ฅ๐
โข When we cannot fit the data well (non-linear), add non-linear
combinations of features
โข Feature map (or basis expansion ) ๐ โถ ๐ โ โ๐
๐ฆ = ๐คโค๐ฅ โ ๐ฆ = ๐คโค๐(๐ฅ)
โข E.g., Polynomial feature map: all polynomials up to degree ๐
๐[1, ๐ฅ1, โฆ , ๐ฅ๐] โ [1, ๐ฅ1, โฆ , ๐ฅ๐, ๐ฅ1
2
, โฆ , ๐ฅ๐
2, โฆ , ๐ฅ๐
๐, ๐ฅ1๐ฅ2, โฆ , ๐ฅ๐๐ฅ๐]
โข Example with p=1,d=3
๐ฆ = ๐ค1๐ฅ1 โ ๐ฆ = ๐ค1๐ฅ1 + ๐ค2๐ฅ1
2
+ ๐ค3๐ฅ1
3
2
4. Realization
โข Computing the transformed dot product ๐ ๐ฅ๐
โค ๐(๐ฅ๐) for all
observation pairs ๐, ๐ is efficient with a kernel function
โข Even when mapping to an infinite feature space (RBF)
โข Two requirements
1. Predict based on ๐ ๐ฅ๐
โค ๐(๐ฅ๐). Donโt rely on ๐คโค ๐ ๐ฅ๐
2. Train (๐ผ) based on ๐ ๐ฅ๐
โค
๐(๐ฅ๐). Donโt train ๐ค
โข ๐ค = ๐ ๐ผ๐๐(๐ฅ๐)
โข ๐คโค๐(๐ง) = ๐ ๐ผ๐๐(๐ฅ๐) โค๐(๐ง) = ๐ ๐ผ๐๐ ๐ฅ๐
โค๐(๐ง)
3
5. Using kernels in ML
โข In order to use kernels in ML algorithms we need to show that we can
train and predict using inner products of the observations
โข Then, we can simply swap the inner product with the kernel function
โข For example, kernelizing (Euclidian) 1 nearest neighbors is
straightforward
โข Training: none
โข Predicting: โ ๐ฅ = ๐ฆ ๐ฅ๐ก : ๐ฅ๐ก = arg min
๐ฅโ๐ท
๐ฅ โ ๐ฅ๐ก 2 = arg min
๐ฅโ๐ท
๐ฅ โ ๐ฅ๐ก 2
2
โข = ๐ฅโค๐ฅ โ 2๐ฅโค๐ฅ๐ก + ๐ฅ๐ก
โค
๐ฅ๐ก
4
๐พ ๐ฅ, ๐ฅ ๐พ ๐ฅ, ๐ฅ๐ก ๐พ ๐ฅ๐ก, ๐ฅ๐ก
6. Using kernels in ML
โข Ordinary Least Squares:
โข arg๐ค min 0.5 ๐ฅ๐ค โ ๐ฆ 2
โข Squared loss
โข No regularization
โข Closed form: ๐ค = ๐ฅโค๐ฅ โ1๐ฅโค๐ฆ
โข Closed form: ๐ผ =?
โข Ridge Regression:
โข arg๐ค min 0.5 ๐ฅ๐ค โ ๐ฆ 2
+ ๐ ๐ค 2
โข Squared loss
โข ๐2-regularization
โข Closed form: ๐ค = ๐ฅโค
๐ฅ + ๐๐ผ โ1
๐ฅโค
๐ฆ
โข Closed form: ๐ผ =?
5
7. From ๐ค to inner product
โข Claim: the weight vector is always some linear combination of the
training feature vectors: ๐ค = ๐ ๐ผ๐๐ฅ๐ = ๐ฅโค๐ผ
โข Was proven last week
6
9. You canโt do thatโฆ
โข You canโt define ๐๐,๐ = ๐ฅ๐
โค
๐ฅ๐ and then say that ๐ = ๐ฅ๐ฅโค
โข Obviously ๐ฅโค๐ฅ โ ๐ฅ๐ฅโค
โข Actually, this is correct. ๐ฅ๐ is a vector and ๐ฅ a matrix. Letโs break it down
โข ๐ฅ =
๐ฅ1
๐ฅ2
โฆ
๐ฅ๐
=
๐ฅ1,1 โฏ ๐ฅ1,๐
โฎ โฑ โฎ
๐ฅ๐,1 โฏ ๐ฅ๐,๐
โข ๐ฅ๐ฅโค
=
๐ฅ1
โฆ
๐ฅ๐
๐ฅ1, โฆ , ๐ฅ๐ =
๐1,1 โฏ ๐1,๐
โฎ โฑ โฎ
๐๐,1 โฏ ๐๐,๐
8
10. OK so what is ๐ฅโค
๐ฅ ?
โข ๐ฅโค๐ฅ = ๐ฅ1, โฆ , ๐ฅ๐
๐ฅ1
โฆ
๐ฅ๐
=
๐น1,1 โฏ ๐น1,๐
โฎ โฑ โฎ
๐น๐,1 โฏ ๐น๐,๐
โข Where ๐น๐,๐ = ๐ก ๐ฅ๐ก,๐๐ฅ๐ก,๐
โข Sanity check: Ordinary Least Squares
โข ๐ค = ๐ฅโค๐ฅ โ1๐ฅโค๐ฆ = โ๐
โข ๐ผ = ๐ฅ๐ฅโค โ1๐ฆ = ๐โ1๐ฆ = โ๐
9
11. What about predictions?
โข We can train a kernelized linear (not linear anymore) regression
model
โข ๐ผ = ๐ฅ๐ฅโค โ1
๐ฆ = ๐โ1
๐ฆ = โ๐
โข Can we use the trained ๐ผ for prediction? This is our end game!
โข Originally, we had โ ๐ฅ๐ = ๐คโค๐ฅ๐
โข But we didnโt train ๐ค
โข ๐ค = ๐ ๐ผ๐๐ฅ๐
โข โ ๐ง = ๐ ๐ผ๐๐ฅ๐
โค๐ง = ๐ ๐ผ๐๐ฅ๐
โค
๐ง
โข This is a linear model with ๐ dimentions
10
12. Kernelized Support Vector Machines
โข min
๐ค,๐
๐คโค
๐ค + ๐ถ ๐ ๐๐
โข S.T.
โข โ๐ ๐ฆ๐ ๐คโค๐ฅ๐ + ๐ โฅ 1 โ ๐๐
โข ๐๐ โฅ 0
โข ๐ถ is a hyper parameter (bias vs variance)
โข Goal: reformulate the optimization problem with inner products and
no ๐ค
โข Step 1, define the dual optimization problem
11
13. Duality principle in optimization
โข Optimization problems may be viewed from
either of two perspectives, the primal problem
or the dual problem
โข The solution to the dual problem provides a
lower bound to the solution of the primal
(minimization) problem
โข For convex optimization problems, the duality
gap is zero under a constraint qualification
condition
12
Duality gap
14. The dual problem
โข Forming the Lagrangian of a minimization problem by using
nonnegative Lagrange multipliers
โข solving for the primal variable values that maximize the dual problem
(minimize the original objective function)
โข The dual problem gives the primal variables as functions of the
Lagrange multipliers, which are called dual variables, so that the new
problem is to maximize the objective function with respect to the
dual variables + derived constraints
13
15. Kernelized Support Vector Machines
โข Primal:
โข min
๐ค,๐
๐คโค๐ค + ๐ถ ๐ ๐๐
โข S.T.
โข โ๐ ๐ฆ๐ ๐คโค๐ฅ๐ + ๐ โฅ 1 โ ๐๐
โข ๐๐ โฅ 0
โข Dual:
โข min
๐ผ1,โฆ,๐ผ๐
1
2 ๐,๐ ๐ผ๐๐ผ๐๐ฆ๐๐ฆ๐๐๐,๐ โ ๐=1
๐
๐ผ๐
โข S.T.
โข 0 โค ๐ผ๐ โค ๐ถ
โข ๐=1
๐
๐ผ๐ ๐ฆ๐ = 0
14
We wonโt derive the dual problem
(requires substantial background)
Bottom line: the objective function is
defined as a function of alphas, labels,
and inner products (no weights)
In this case, we can show that
๐ค =
๐=1
๐
๐ผ๐๐ฆ๐๐ ๐ฅ๐
Where ๐ฆ โ {โ1, +1}
Problem: ๐ is not part of the dual
optimization
16. Kernelized Support Vector Machines
โข For the primal formulation we know (from a previous lecture) that
only support vectors satisfy the constraint with equality:
๐ฆ๐ ๐คโค๐ ๐ฅ๐ + ๐ = 1
โข In the dual, these same training inputs can be identified as their
corresponding dual values satisfy ๐ผ๐ > 0 (all other training inputs
have ๐ผ๐ = 0 )
โข In test-time you only need to compute the sum in โ(๐) over the
support vectors and all inputs ๐๐ with ๐ผ๐ = 0 can be discarded after
training
โข This fact can allow us to compute ๐ in closed form
15
17. Kernelized Support Vector Machines
โข Primal: support vectors have ๐ฆ๐ ๐คโค๐ ๐ฅ๐ + ๐ = 1
โข Dual: support vectors have ๐ผ๐ > 0
โข The primal solution and the dual solution are identical
โข As a result, โ๐๐ผ๐>0 ๐ฆ๐ ๐ ๐ฆ๐๐ผ๐๐๐,๐ + ๐ = 1
โข ๐ =
1
๐ฆ๐
โ ๐ ๐ฆ๐๐ผ๐๐๐,๐ = ๐ฆ๐ โ ๐ ๐ฆ๐๐ผ๐๐๐,๐
16
๐ฆ โ โ1, +1 โ
1
๐ฆ๐
= ๐ฆ๐
18. Kernel SVM = weighted K-NN
โข K-NN with ๐ฆ โ โ1, +1
โข โ ๐ง = sign ๐=1
๐
๐ฆ๐๐ฟ๐๐ ๐ฅ๐, ๐ง
โข ๐ฟ๐๐ ๐ฅ๐, ๐ง =
1, ๐ฅ๐ โ ๐พ โ Nearest neighbors
0, else
โข Kernel SVM
โข โ ๐ง = sign ๐=1
๐
๐ฆ๐๐ผ๐๐ ๐ฅ๐, ๐ง + ๐
โข Instead of considering the K nearest neighbors equally, Kernel SVM
considers all neighbors scaled by a distance measure (the kernel) and
a unique learned scale per data point (alpha)
17
21. Kernelized SVM
โข Pros
โข SVM classification can be very efficient, because it uses only a subset of the training
data, only the support vectors
โข Works very well on smaller data sets, on non-linear data sets and high dimensional
spaces
โข Is very effective in cases where number of dimensions is greater than the number
of samples
โข It can have high accuracy, sometimes can perform even better than neural networks
โข Not very sensitive to overfitting
โข Cons
โข Training time is high when we have large data sets
โข When the data set has more noise (i.e. target classes are overlapping) SVM doesnโt
perform well
20
22. What did we learn?
โข Kernel functions allow us to utilize powerful linear models to predict
non-linear patterns
โข Requires us to represent the linear model through ๐ฅ๐
โค
๐ฅ๐ and ๐ผ while
removing the weight vector ๐ค
โข Once we have an appropriate representation, we simply swap ๐ฅ๐
โค
๐ฅ๐
with ๐๐,๐
23. What next?
โข Class: Decision trees
โข Assignments:
โข Assignment (P3): SVM, linear regression and kernelization, due Tuesday Nov-
16
โข Quizzes:
โข Quiz 4: ML debugging and kernelization, due Nov 4
22