SlideShare a Scribd company logo
1 of 43
• Name- Sayan Ghosh
• Roll- 14400120032
• Sub Code- PEC-CS701E
• Sub Name- Machine Learning
• Academic Session- 2023-24(Odd
Sem)
• Sem- 7th
• Dept- CSE
• College- NITMAS(144)
Support Vector Machines , Non Linearity and Kernel
Methods
Getting good generalization on big datasets
• If we have a big data set that needs a
complicated model, the full Bayesian framework
is very computationally expensive.
• Is there a frequentist method that is faster but
still generalizes well?
Preprocessing the input vectors
• Instead of trying to predict the answer directly from the
raw inputs we could start by extracting a layer of
“features”.
– Sensible if we already know that certain combinations
of input values would be useful (e.g. edges or corners
in an image).
• Instead of learning the features we could design them by
hand.
– The hand-coded features are equivalent to a layer of
non-linear neurons that do not need to be learned.
– If we use a very big set of features for a two-class
problem, the classes will almost certainly be linearly
separable.
• But surely the linear separator will give poor generalization.
Is preprocessing cheating?
• Its cheating if we use a carefully designed set of task-
specific, hand-coded features and then claim that the
learning algorithm solved the whole problem.
– The really hard bit is done by designing the features.
• Its not cheating if we learn the non-linear preprocessing.
– This makes learning much more difficult and much more
interesting (e.g. backpropagation after pre-training)
• Its not cheating if we use a very big set of non-linear
features that is task-independent.
– Support Vector Machines do this.
– They have a clever way to prevent overfitting (first half of
lecture)
– They have a very clever way to use a huge number of
features without requiring nearly as much computation as
seems to be necessary (second half of lecture).
A hierarchy of model classes
• Some model classes can be arranged in a
hierarchy of increasing complexity.
• How do we pick the best level in the hierarchy
for modeling a given dataset?
A way to choose a model class
• We want to get a low error rate on unseen data.
– This is called “structural risk minimization”
• It would be really helpful if we could get a guarantee of
the following form:
Test error rate =< train error rate + f(N, h, p)
Where N = size of training set,
h = measure of the model complexity,
p = the probability that this bound fails
We need p to allow for really unlucky test sets.
• Then we could choose the model complexity that
minimizes the bound on the test error rate.
A weird measure of model complexity
• Suppose that we pick n datapoints and assign labels of +
or – to them at random. If our model class (e.g. a neural
net with a certain number of hidden units) is powerful
enough to learn any association of labels with the data,
its too powerful!
• Maybe we can characterize the power of a model class
by asking how many datapoints it can “shatter” i.e. learn
perfectly for all possible assignments of labels.
– This number of datapoints is called the Vapnik-
Chervonenkis dimension.
– The model does not need to shatter all sets of
datapoints of size h. One set is sufficient.
• For planes in 3-D, h=4 even though 4 co-planar points cannot
be shattered.
An example of VC dimension
• But we cannot deal with some of the possible labelings of four
points. A 2-D hyperplane (i.e. a line) does not shatter 4 points.
• Suppose our model class is a hyperplane.
• In 2-D, we can find a plane (i.e. a line) to deal with any
labeling of three points. A 2-D hyperplane shatters 3 points
Some examples of VC dimension
• The VC dimension of a hyperplane in 2-D is 3.
– In k dimensions it is k+1.
• Its just a coincidence that the VC dimension of a
hyperplane is almost identical to the number of
parameters it takes to define a hyperplane.
• A sine wave has infinite VC dimension and only 2
parameters! By choosing the phase and period carefully
we can shatter any random collection of one-dimensional
datapoints (except for nasty special cases).
)
sin(
)
( x
b
a
x
f 
The probabilistic guarantee
where N = size of training set
h = VC dimension of the model class
p = upper bound on probability that this bound fails
So if we train models with different complexity, we should
pick the one that minimizes this bound
Actually, this is only sensible if we think the bound is
fairly tight, which it usually isn’t. The theory provides
insight, but in practice we still need some witchcraft.
2
1
)
4
/
log(
)
/
2
log(





 



N
p
h
N
h
h
E
E train
test
Preventing overfitting when using big sets of features
• Suppose we use a big set of features
to ensure that the two classes are
linearly separable. What is the best
separating line to use?
• The Bayesian answer is to use them
all (including ones that do not quite
separate the data.)
• Weight each line by its posterior
probability (i.e. by a combination of
how well it fits the data and how well it
fits the prior).
• Is there an efficient way to
approximate the correct Bayesian
answer?
Support Vector Machines
• The line that maximizes the minimum
margin is a good bet.
– The model class of “hyper-planes
with a margin of m” has a low VC
dimension if m is big.
• This maximum-margin separator is
determined by a subset of the
datapoints.
– Datapoints in this subset are
called “support vectors”.
– It will be useful computationally if
only a small fraction of the
datapoints are support vectors,
because we use the support
vectors to decide which side of the
separator a test case is on.
The support vectors
are indicated by the
circles around them.
Training a linear SVM
• To find the maximum margin separator, we have to solve the
following optimization problem:
• This is tricky but it’s a convex problem. There is only one
optimum and we can find it without fiddling with learning
rates or weight decay or early stopping.
– Don’t worry about the optimization problem. It has been
solved. Its called quadratic programming.
– It takes time proportional to N^2 which is really bad for
very big datasets
• so for big datasets we end up doing approximate optimization!
possible
as
small
as
is
and
cases
negative
for
b
cases
positive
for
b
c
c
2
||
||
1
.
1
.
w
x
w
x
w






Testing a linear SVM
• The separator is defined as the set of points for
which:
case
negative
a
its
say
b
if
and
case
positive
a
its
say
b
if
so
b
c
c
0
.
0
.
0
.






x
w
x
w
x
w
A Bayesian Interpretation
• Using the maximum margin separator often
gives a pretty good approximation to using all
separators weighted by their posterior
probabilities.
What to do if there is no separating plane
• Use a much bigger set of features.
– This looks as if it would make the computation
hopelessly slow, but in the next part of the
lecture we will see how to use the “kernel”
trick to make the computation fast even with
huge numbers of features.
• Extend the definition of maximum margin to
allow non-separating planes.
– This can be done by using “slack” variables
Introducing slack variables
• Slack variables are constrained to be non-negative.
When they are greater than zero they allow us to cheat
by putting the plane closer to the datapoint than the
margin. So we need to minimize the amount of cheating.
This means we have to pick a value for lamba (this
sounds familiar!)
possible
as
small
as
and
c
all
for
with
cases
negative
for
b
cases
positive
for
b
c
c
c
c
c
c
c
















2
||
||
0
1
.
1
.
2
w
x
w
x
w
A picture of the best plane with a slack variable
The story so far
• If we use a large set of non-adaptive features, we can
often make the two classes linearly separable.
– But if we just fit any old separating plane, it will not
generalize well to new cases.
• If we fit the separating plane that maximizes the margin
(the minimum distance to any of the data points), we will
get much better generalization.
– Intuitively, by maximizing the margin we are
squeezing out all the surplus capacity that came from
using a high-dimensional feature space.
• This can be justified by a whole lot of clever mathematics
which shows that
– large margin separators have lower VC dimension.
– models with lower VC dimension have a smaller gap
between the training and test error rates.
Why do large margin separators have lower VC
dimension?
• Consider a random set of N points that
all fit inside a unit hypercube.
• If the number of dimensions is bigger
than N-2, it is easy to find a separating
plane for any labeling of the points.
– So the fact that there is a separating
plane doesn’t tell us much. It like
putting a straight line through 2 data
points.
• But there is unlikely to be a separating
plane with a margin that is big
– If we find such a plane its unlikely to
be a coincidence. So it will probably
apply to the test data too.
How to make a plane curved
• Fitting hyperplanes as
separators is mathematically
easy.
– The mathematics is linear.
• By replacing the raw input
variables with a much larger
set of features we get a nice
property:
– A planar separator in the
high-dimensional space of
feature vectors is a curved
separator in the low
dimensional space of the
raw input variables.
A planar separator in
a 20-D feature space
projected back to the
original 2-D space
A potential problem and a magic solution
• If we map the input vectors into a very high-dimensional
feature space, surely the task of finding the maximum-
margin separator becomes computationally intractable?
– The mathematics is all linear, which is good, but the
vectors have a huge number of components.
– So taking the scalar product of two vectors is very
expensive.
• The way to keep things tractable is to use
“the kernel trick”
• The kernel trick makes your brain hurt when you first
learn about it, but its actually very simple.
What the kernel trick achieves
• All of the computations that we need to do to find the
maximum-margin separator can be expressed in terms of
scalar products between pairs of datapoints (in the high-
dimensional feature space).
• These scalar products are the only part of the computation
that depends on the dimensionality of the high-dimensional
space.
– So if we had a fast way to do the scalar products we
would not have to pay a price for solving the learning
problem in the high-D space.
• The kernel trick is just a magic way of doing scalar products
a whole lot faster than is usually possible.
– It relies on choosing a way of mapping to the high-
dimensional feature space that allows fast scalar
products.
The kernel trick
• For many mappings from
a low-D space to a high-D
space, there is a simple
operation on two vectors
in the low-D space that
can be used to compute
the scalar product of their
two images in the high-D
space.
)
(
.
)
(
)
,
( b
a
b
a
x
x
x
x
K 



Low-D
High-D
doing the scalar
product in the
obvious way
Letting the
kernel do
the work
a
x
)
( a
x

)
( b
x

b
x
Dealing with the test data
• If we choose a mapping to a high-D space for
which the kernel trick works, we do not have to
pay a computational price for the high-
dimensionality when we find the best hyper-plane.
– We cannot express the hyperplane by using its normal
vector in the high-dimensional space because this
vector would have a huge number of components.
– Luckily, we can express it in terms of the support
vectors.
• But what about the test data. We cannot compute
the scalar product because its in the
high-D space.
)
(
. x
w 
Dealing with the test data
• We need to decide which side of the separating
hyperplane a test point lies on and this requires
us to compute a scalar product.
• We can express this scalar product as a
weighted average of scalar products with the
stored support vectors
– This could still be slow if there are a lot of
support vectors .
The classification rule
• The final classification rule is quite simple:
• All the cleverness goes into selecting the support vectors
that maximize the margin and computing the weight to
use on each support vector.
• We also need to choose a good kernel function and we
may need to choose a lambda for dealing with non-
separable cases.
 

SV
s
s
test
s x
x
K
w
bias

0
)
,
(
The set of
support vectors
Some commonly used kernels
)
(
tanh
)
,
(
)
,
(
)
1
.
(
)
,
(
2
2
2
/
||
||









x.y
y
x
y
x
y
x
y
x
y
x
k
K
e
K
K p
Polynomial:
Gaussian
radial basis
function
Neural net:
For the neural network kernel, there is one “hidden unit”
per support vector, so the process of fitting the maximum
margin hyperplane decides how many hidden units to use.
Also, it may violate Mercer’s condition.
Parameters
that the user
must choose
Performance
• Support Vector Machines work very well in practice.
– The user must choose the kernel function and its
parameters, but the rest is automatic.
– The test performance is very good.
• They can be expensive in time and space for big datasets
– The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
– We need to store all the support vectors.
• SVM’s are very good if you have no idea about what
structure to impose on the task.
• The kernel trick can also be used to do PCA in a much
higher-dimensional space, thus giving a non-linear version
of PCA in the original space.
Support Vector Machines are Perceptrons!
• SVM’s use each training case, x, to define a feature K(x,
.) where K is chosen by the user.
– So the user designs the features.
• Then they do “feature selection” by picking the support
vectors, and they learn how to weight the features by
solving a big optimization problem.
• So an SVM is just a very clever way to train a standard
perceptron.
– All of the things that a perceptron cannot do cannot
be done by SVM’s (but it’s a long time since 1969 so
people have forgotten this).
A problem that cannot be solved using a
kernel that computes the similarity of a test
image to a training case
• Suppose we have images that may contain a tank, but
with a cluttered background.
• To recognize which ones contain a tank, it is no good
computing a global similarity
– A non-tank test image may have a very similar
background to a tank training image, so it will have
very high similarity if the tanks are only a small
fraction of the image.
• We need local features that are appropriate for the task.
So they must be learned, not pre-specified.
• Its very appealing to convert a learning problem to a
convex optimization problem
– but we may end up by ignoring aspects of the real
learning problem in order to make it convex.
A hybrid approach
• If we use a neural net to define the features, maybe we
can use convex optimization for the final layer of weights
and then backpropagate derivatives to “learn the kernel”.
• The convex optimization is quadratic in the number of
training cases. So this approach works best when most
of the data is unlabelled.
– Unsupervised pre-training can then use the
unlabelled data to learn features that are appropriate
for the domain.
– The final convex optimization can use these features
as well as possible and also provide derivatives that
allow them to be fine-tuned.
– This seems better than just trying lots of kernels and
selecting the best ones (which is the current method).
Learning to extract the orientation of a face
patch (Ruslan Salakhutdinov)
The training and test sets
11,000 unlabeled cases
100, 500, or 1000 labeled cases
face patches from new people
The root mean squared error in the orientation
when combining GP’s with deep belief nets
22.2 17.9 15.2
17.2 12.7 7.2
16.3 11.2 6.4
GP on
the
pixels
GP on
top-level
features
GP on top-level
features with
fine-tuning
100 labels
500 labels
1000 labels
Conclusion: The deep features are much better
than the pixels. Fine-tuning helps a lot.
Non-linearity in Machine Learning
• Introduction: Non-linearity is a fundamental
concept in machine learning that refers to the
relationship between input and output variables
in a model. Linear models, such as linear
regression, assume a linear relationship
between the input and output variables, while
non-linear models allow for a more complex
relationship between the variables. In this article,
we will discuss the concept of non-linearity in
machine learning, its applications, and some
examples of non-linear models.
• What is Non-Linearity? Non-linearity refers to
the relationship between input and output
variables in a model. In a linear model, the
relationship between the variables is
represented by a straight line, while in a non-
linear model, the relationship is represented by a
more complex function. The simplest example of
a non-linear function is the quadratic function,
which has a parabolic shape. Non-linearity can
also refer to the relationship between multiple
input variables and the output variable. For
example, in a linear model, the relationship
between two input variables and the output
variable is represented by a plane, while in a
non-linear model, the relationship is represented
Applications of Non-Linearity:
• Non-linear models are useful in a wide range of
applications, including image and speech
recognition, natural language processing, and
time series forecasting. For example, non-linear
models can be used to classify images based on
their content, such as identifying objects in an
image. Non-linear models can also be used to
recognize speech, by modeling the relationship
between the audio signal and the spoken words.
• In natural language processing, non-linear
models can be used to understand the meaning
of text, such as identifying the subject, predicate,
and object in a sentence. In time series
forecasting, non-linear models can be used to
predict future values based on historical data,
such as predicting stock prices.
Examples of Non-Linear Models:
• There are many different types of non-linear
models, each with their own strengths and
weaknesses. Some examples of non-linear
models include:
• Neural networks: Neural networks are a type of
non-linear model that are inspired by the
structure of the human brain. They consist of
layers of interconnected nodes, or artificial
neurons, that process and transmit information.
Neural networks can be used for a wide range of
tasks, including image and speech recognition,
natural language processing, and time series
forecasting.
• Support Vector Machines (SVMs): Support
Vector Machines are a type of non-linear model
that are used for classification and regression
tasks. SVMs find the hyperplane that separates
different classes of data in a high-dimensional
space. They can also be used to classify data
that is not linearly separable by using a non-
linear kernel function to transform the data into a
higher-dimensional space
• Random Forest: Random Forest is a type of
non-linear model that is used for classification
and regression tasks. It is an ensemble method
that combines multiple decision trees to make
predictions. The decision trees are generated by
randomly selecting subsets of the input variables
and building a tree based on those variables.
• Gradient Boosting: Gradient Boosting is another
type of non-linear model that is used for
classification and regression tasks. It is an
ensemble method that combines multiple
decision trees to make predictions. The decision
trees are generated by building multiple trees in
sequence, where each tree tries to correct the
mistakes of the previous tree.
Conclusion
• In conclusion, non-linearity in machine learning
is a powerful tool for capturing complex
relationships between inputs and outputs that
cannot be represented by simple linear
functions. It is used in a wide range of machine
learning tasks, including classification,
regression, and clustering. Non-linear models
offer greater flexibility and interpretability than
linear models and are widely used in neural
networks, decision trees, and support vector
machines.

More Related Content

Similar to lec10svm.ppt

sentiment analysis using support vector machine
sentiment analysis using support vector machinesentiment analysis using support vector machine
sentiment analysis using support vector machineShital Andhale
 
support vector machine 1.pptx
support vector machine 1.pptxsupport vector machine 1.pptx
support vector machine 1.pptxsurbhidutta4
 
classification algorithms in machine learning.pptx
classification algorithms in machine learning.pptxclassification algorithms in machine learning.pptx
classification algorithms in machine learning.pptxjasontseng19
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVMCarlo Carandang
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerZahra Sadeghi
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?Tuan Yang
 
properties, application and issues of support vector machine
properties, application and issues of support vector machineproperties, application and issues of support vector machine
properties, application and issues of support vector machineDr. Radhey Shyam
 
SVM & KNN Presentation.pptx
SVM & KNN Presentation.pptxSVM & KNN Presentation.pptx
SVM & KNN Presentation.pptxMohamedMonir33
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
UE19EC353 ML Unit4_slides.pptx
UE19EC353 ML Unit4_slides.pptxUE19EC353 ML Unit4_slides.pptx
UE19EC353 ML Unit4_slides.pptxpremkumar901866
 
Support Vector Machine.pptx
Support Vector Machine.pptxSupport Vector Machine.pptx
Support Vector Machine.pptxHarishNayak44
 
Unit III_Ch 17_Probablistic Methods.pptx
Unit III_Ch 17_Probablistic Methods.pptxUnit III_Ch 17_Probablistic Methods.pptx
Unit III_Ch 17_Probablistic Methods.pptxsmithashetty24
 
Support Vector Machine topic of machine learning.pptx
Support Vector Machine topic of machine learning.pptxSupport Vector Machine topic of machine learning.pptx
Support Vector Machine topic of machine learning.pptxCodingChamp1
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 

Similar to lec10svm.ppt (20)

sentiment analysis using support vector machine
sentiment analysis using support vector machinesentiment analysis using support vector machine
sentiment analysis using support vector machine
 
support vector machine 1.pptx
support vector machine 1.pptxsupport vector machine 1.pptx
support vector machine 1.pptx
 
lecture_16.pptx
lecture_16.pptxlecture_16.pptx
lecture_16.pptx
 
classification algorithms in machine learning.pptx
classification algorithms in machine learning.pptxclassification algorithms in machine learning.pptx
classification algorithms in machine learning.pptx
 
svm.pptx
svm.pptxsvm.pptx
svm.pptx
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVM
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
properties, application and issues of support vector machine
properties, application and issues of support vector machineproperties, application and issues of support vector machine
properties, application and issues of support vector machine
 
SVM & KNN Presentation.pptx
SVM & KNN Presentation.pptxSVM & KNN Presentation.pptx
SVM & KNN Presentation.pptx
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
UE19EC353 ML Unit4_slides.pptx
UE19EC353 ML Unit4_slides.pptxUE19EC353 ML Unit4_slides.pptx
UE19EC353 ML Unit4_slides.pptx
 
Support Vector Machine.pptx
Support Vector Machine.pptxSupport Vector Machine.pptx
Support Vector Machine.pptx
 
Unit III_Ch 17_Probablistic Methods.pptx
Unit III_Ch 17_Probablistic Methods.pptxUnit III_Ch 17_Probablistic Methods.pptx
Unit III_Ch 17_Probablistic Methods.pptx
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Support vector machine-SVM's
Support vector machine-SVM'sSupport vector machine-SVM's
Support vector machine-SVM's
 
Support Vector Machine topic of machine learning.pptx
Support Vector Machine topic of machine learning.pptxSupport Vector Machine topic of machine learning.pptx
Support Vector Machine topic of machine learning.pptx
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 

More from TheULTIMATEALLROUNDE

More from TheULTIMATEALLROUNDE (7)

vai pdf name change kore de.pptx
vai pdf name change kore de.pptxvai pdf name change kore de.pptx
vai pdf name change kore de.pptx
 
Web.pptx
Web.pptxWeb.pptx
Web.pptx
 
Web Database.pptx
Web Database.pptxWeb Database.pptx
Web Database.pptx
 
Banglka.pptx
Banglka.pptxBanglka.pptx
Banglka.pptx
 
_14400120021_Human Computer Interaction (2).pptx
_14400120021_Human Computer Interaction (2).pptx_14400120021_Human Computer Interaction (2).pptx
_14400120021_Human Computer Interaction (2).pptx
 
SAYAN14_HCI PDF.pptx
SAYAN14_HCI PDF.pptxSAYAN14_HCI PDF.pptx
SAYAN14_HCI PDF.pptx
 
Software Engineering CSE/IT.pptx
 Software Engineering CSE/IT.pptx Software Engineering CSE/IT.pptx
Software Engineering CSE/IT.pptx
 

Recently uploaded

Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Roomdivyansh0kumar0
 
Complet Documnetation for Smart Assistant Application for Disabled Person
Complet Documnetation   for Smart Assistant Application for Disabled PersonComplet Documnetation   for Smart Assistant Application for Disabled Person
Complet Documnetation for Smart Assistant Application for Disabled Personfurqan222004
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一3sw2qly1
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Roomishabajaj13
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls KolkataVIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 

Recently uploaded (20)

Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
 
Complet Documnetation for Smart Assistant Application for Disabled Person
Complet Documnetation   for Smart Assistant Application for Disabled PersonComplet Documnetation   for Smart Assistant Application for Disabled Person
Complet Documnetation for Smart Assistant Application for Disabled Person
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls KolkataVIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 

lec10svm.ppt

  • 1. • Name- Sayan Ghosh • Roll- 14400120032 • Sub Code- PEC-CS701E • Sub Name- Machine Learning • Academic Session- 2023-24(Odd Sem) • Sem- 7th • Dept- CSE • College- NITMAS(144) Support Vector Machines , Non Linearity and Kernel Methods
  • 2. Getting good generalization on big datasets • If we have a big data set that needs a complicated model, the full Bayesian framework is very computationally expensive. • Is there a frequentist method that is faster but still generalizes well?
  • 3. Preprocessing the input vectors • Instead of trying to predict the answer directly from the raw inputs we could start by extracting a layer of “features”. – Sensible if we already know that certain combinations of input values would be useful (e.g. edges or corners in an image). • Instead of learning the features we could design them by hand. – The hand-coded features are equivalent to a layer of non-linear neurons that do not need to be learned. – If we use a very big set of features for a two-class problem, the classes will almost certainly be linearly separable. • But surely the linear separator will give poor generalization.
  • 4. Is preprocessing cheating? • Its cheating if we use a carefully designed set of task- specific, hand-coded features and then claim that the learning algorithm solved the whole problem. – The really hard bit is done by designing the features. • Its not cheating if we learn the non-linear preprocessing. – This makes learning much more difficult and much more interesting (e.g. backpropagation after pre-training) • Its not cheating if we use a very big set of non-linear features that is task-independent. – Support Vector Machines do this. – They have a clever way to prevent overfitting (first half of lecture) – They have a very clever way to use a huge number of features without requiring nearly as much computation as seems to be necessary (second half of lecture).
  • 5. A hierarchy of model classes • Some model classes can be arranged in a hierarchy of increasing complexity. • How do we pick the best level in the hierarchy for modeling a given dataset?
  • 6. A way to choose a model class • We want to get a low error rate on unseen data. – This is called “structural risk minimization” • It would be really helpful if we could get a guarantee of the following form: Test error rate =< train error rate + f(N, h, p) Where N = size of training set, h = measure of the model complexity, p = the probability that this bound fails We need p to allow for really unlucky test sets. • Then we could choose the model complexity that minimizes the bound on the test error rate.
  • 7. A weird measure of model complexity • Suppose that we pick n datapoints and assign labels of + or – to them at random. If our model class (e.g. a neural net with a certain number of hidden units) is powerful enough to learn any association of labels with the data, its too powerful! • Maybe we can characterize the power of a model class by asking how many datapoints it can “shatter” i.e. learn perfectly for all possible assignments of labels. – This number of datapoints is called the Vapnik- Chervonenkis dimension. – The model does not need to shatter all sets of datapoints of size h. One set is sufficient. • For planes in 3-D, h=4 even though 4 co-planar points cannot be shattered.
  • 8. An example of VC dimension • But we cannot deal with some of the possible labelings of four points. A 2-D hyperplane (i.e. a line) does not shatter 4 points. • Suppose our model class is a hyperplane. • In 2-D, we can find a plane (i.e. a line) to deal with any labeling of three points. A 2-D hyperplane shatters 3 points
  • 9. Some examples of VC dimension • The VC dimension of a hyperplane in 2-D is 3. – In k dimensions it is k+1. • Its just a coincidence that the VC dimension of a hyperplane is almost identical to the number of parameters it takes to define a hyperplane. • A sine wave has infinite VC dimension and only 2 parameters! By choosing the phase and period carefully we can shatter any random collection of one-dimensional datapoints (except for nasty special cases). ) sin( ) ( x b a x f 
  • 10. The probabilistic guarantee where N = size of training set h = VC dimension of the model class p = upper bound on probability that this bound fails So if we train models with different complexity, we should pick the one that minimizes this bound Actually, this is only sensible if we think the bound is fairly tight, which it usually isn’t. The theory provides insight, but in practice we still need some witchcraft. 2 1 ) 4 / log( ) / 2 log(           N p h N h h E E train test
  • 11. Preventing overfitting when using big sets of features • Suppose we use a big set of features to ensure that the two classes are linearly separable. What is the best separating line to use? • The Bayesian answer is to use them all (including ones that do not quite separate the data.) • Weight each line by its posterior probability (i.e. by a combination of how well it fits the data and how well it fits the prior). • Is there an efficient way to approximate the correct Bayesian answer?
  • 12. Support Vector Machines • The line that maximizes the minimum margin is a good bet. – The model class of “hyper-planes with a margin of m” has a low VC dimension if m is big. • This maximum-margin separator is determined by a subset of the datapoints. – Datapoints in this subset are called “support vectors”. – It will be useful computationally if only a small fraction of the datapoints are support vectors, because we use the support vectors to decide which side of the separator a test case is on. The support vectors are indicated by the circles around them.
  • 13. Training a linear SVM • To find the maximum margin separator, we have to solve the following optimization problem: • This is tricky but it’s a convex problem. There is only one optimum and we can find it without fiddling with learning rates or weight decay or early stopping. – Don’t worry about the optimization problem. It has been solved. Its called quadratic programming. – It takes time proportional to N^2 which is really bad for very big datasets • so for big datasets we end up doing approximate optimization! possible as small as is and cases negative for b cases positive for b c c 2 || || 1 . 1 . w x w x w      
  • 14. Testing a linear SVM • The separator is defined as the set of points for which: case negative a its say b if and case positive a its say b if so b c c 0 . 0 . 0 .       x w x w x w
  • 15. A Bayesian Interpretation • Using the maximum margin separator often gives a pretty good approximation to using all separators weighted by their posterior probabilities.
  • 16. What to do if there is no separating plane • Use a much bigger set of features. – This looks as if it would make the computation hopelessly slow, but in the next part of the lecture we will see how to use the “kernel” trick to make the computation fast even with huge numbers of features. • Extend the definition of maximum margin to allow non-separating planes. – This can be done by using “slack” variables
  • 17. Introducing slack variables • Slack variables are constrained to be non-negative. When they are greater than zero they allow us to cheat by putting the plane closer to the datapoint than the margin. So we need to minimize the amount of cheating. This means we have to pick a value for lamba (this sounds familiar!) possible as small as and c all for with cases negative for b cases positive for b c c c c c c c                 2 || || 0 1 . 1 . 2 w x w x w
  • 18. A picture of the best plane with a slack variable
  • 19. The story so far • If we use a large set of non-adaptive features, we can often make the two classes linearly separable. – But if we just fit any old separating plane, it will not generalize well to new cases. • If we fit the separating plane that maximizes the margin (the minimum distance to any of the data points), we will get much better generalization. – Intuitively, by maximizing the margin we are squeezing out all the surplus capacity that came from using a high-dimensional feature space. • This can be justified by a whole lot of clever mathematics which shows that – large margin separators have lower VC dimension. – models with lower VC dimension have a smaller gap between the training and test error rates.
  • 20. Why do large margin separators have lower VC dimension? • Consider a random set of N points that all fit inside a unit hypercube. • If the number of dimensions is bigger than N-2, it is easy to find a separating plane for any labeling of the points. – So the fact that there is a separating plane doesn’t tell us much. It like putting a straight line through 2 data points. • But there is unlikely to be a separating plane with a margin that is big – If we find such a plane its unlikely to be a coincidence. So it will probably apply to the test data too.
  • 21. How to make a plane curved • Fitting hyperplanes as separators is mathematically easy. – The mathematics is linear. • By replacing the raw input variables with a much larger set of features we get a nice property: – A planar separator in the high-dimensional space of feature vectors is a curved separator in the low dimensional space of the raw input variables. A planar separator in a 20-D feature space projected back to the original 2-D space
  • 22. A potential problem and a magic solution • If we map the input vectors into a very high-dimensional feature space, surely the task of finding the maximum- margin separator becomes computationally intractable? – The mathematics is all linear, which is good, but the vectors have a huge number of components. – So taking the scalar product of two vectors is very expensive. • The way to keep things tractable is to use “the kernel trick” • The kernel trick makes your brain hurt when you first learn about it, but its actually very simple.
  • 23. What the kernel trick achieves • All of the computations that we need to do to find the maximum-margin separator can be expressed in terms of scalar products between pairs of datapoints (in the high- dimensional feature space). • These scalar products are the only part of the computation that depends on the dimensionality of the high-dimensional space. – So if we had a fast way to do the scalar products we would not have to pay a price for solving the learning problem in the high-D space. • The kernel trick is just a magic way of doing scalar products a whole lot faster than is usually possible. – It relies on choosing a way of mapping to the high- dimensional feature space that allows fast scalar products.
  • 24. The kernel trick • For many mappings from a low-D space to a high-D space, there is a simple operation on two vectors in the low-D space that can be used to compute the scalar product of their two images in the high-D space. ) ( . ) ( ) , ( b a b a x x x x K     Low-D High-D doing the scalar product in the obvious way Letting the kernel do the work a x ) ( a x  ) ( b x  b x
  • 25. Dealing with the test data • If we choose a mapping to a high-D space for which the kernel trick works, we do not have to pay a computational price for the high- dimensionality when we find the best hyper-plane. – We cannot express the hyperplane by using its normal vector in the high-dimensional space because this vector would have a huge number of components. – Luckily, we can express it in terms of the support vectors. • But what about the test data. We cannot compute the scalar product because its in the high-D space. ) ( . x w 
  • 26. Dealing with the test data • We need to decide which side of the separating hyperplane a test point lies on and this requires us to compute a scalar product. • We can express this scalar product as a weighted average of scalar products with the stored support vectors – This could still be slow if there are a lot of support vectors .
  • 27. The classification rule • The final classification rule is quite simple: • All the cleverness goes into selecting the support vectors that maximize the margin and computing the weight to use on each support vector. • We also need to choose a good kernel function and we may need to choose a lambda for dealing with non- separable cases.    SV s s test s x x K w bias  0 ) , ( The set of support vectors
  • 28. Some commonly used kernels ) ( tanh ) , ( ) , ( ) 1 . ( ) , ( 2 2 2 / || ||          x.y y x y x y x y x y x k K e K K p Polynomial: Gaussian radial basis function Neural net: For the neural network kernel, there is one “hidden unit” per support vector, so the process of fitting the maximum margin hyperplane decides how many hidden units to use. Also, it may violate Mercer’s condition. Parameters that the user must choose
  • 29. Performance • Support Vector Machines work very well in practice. – The user must choose the kernel function and its parameters, but the rest is automatic. – The test performance is very good. • They can be expensive in time and space for big datasets – The computation of the maximum-margin hyper-plane depends on the square of the number of training cases. – We need to store all the support vectors. • SVM’s are very good if you have no idea about what structure to impose on the task. • The kernel trick can also be used to do PCA in a much higher-dimensional space, thus giving a non-linear version of PCA in the original space.
  • 30. Support Vector Machines are Perceptrons! • SVM’s use each training case, x, to define a feature K(x, .) where K is chosen by the user. – So the user designs the features. • Then they do “feature selection” by picking the support vectors, and they learn how to weight the features by solving a big optimization problem. • So an SVM is just a very clever way to train a standard perceptron. – All of the things that a perceptron cannot do cannot be done by SVM’s (but it’s a long time since 1969 so people have forgotten this).
  • 31. A problem that cannot be solved using a kernel that computes the similarity of a test image to a training case • Suppose we have images that may contain a tank, but with a cluttered background. • To recognize which ones contain a tank, it is no good computing a global similarity – A non-tank test image may have a very similar background to a tank training image, so it will have very high similarity if the tanks are only a small fraction of the image. • We need local features that are appropriate for the task. So they must be learned, not pre-specified. • Its very appealing to convert a learning problem to a convex optimization problem – but we may end up by ignoring aspects of the real learning problem in order to make it convex.
  • 32. A hybrid approach • If we use a neural net to define the features, maybe we can use convex optimization for the final layer of weights and then backpropagate derivatives to “learn the kernel”. • The convex optimization is quadratic in the number of training cases. So this approach works best when most of the data is unlabelled. – Unsupervised pre-training can then use the unlabelled data to learn features that are appropriate for the domain. – The final convex optimization can use these features as well as possible and also provide derivatives that allow them to be fine-tuned. – This seems better than just trying lots of kernels and selecting the best ones (which is the current method).
  • 33. Learning to extract the orientation of a face patch (Ruslan Salakhutdinov)
  • 34. The training and test sets 11,000 unlabeled cases 100, 500, or 1000 labeled cases face patches from new people
  • 35. The root mean squared error in the orientation when combining GP’s with deep belief nets 22.2 17.9 15.2 17.2 12.7 7.2 16.3 11.2 6.4 GP on the pixels GP on top-level features GP on top-level features with fine-tuning 100 labels 500 labels 1000 labels Conclusion: The deep features are much better than the pixels. Fine-tuning helps a lot.
  • 36. Non-linearity in Machine Learning • Introduction: Non-linearity is a fundamental concept in machine learning that refers to the relationship between input and output variables in a model. Linear models, such as linear regression, assume a linear relationship between the input and output variables, while non-linear models allow for a more complex relationship between the variables. In this article, we will discuss the concept of non-linearity in machine learning, its applications, and some examples of non-linear models.
  • 37. • What is Non-Linearity? Non-linearity refers to the relationship between input and output variables in a model. In a linear model, the relationship between the variables is represented by a straight line, while in a non- linear model, the relationship is represented by a more complex function. The simplest example of a non-linear function is the quadratic function, which has a parabolic shape. Non-linearity can also refer to the relationship between multiple input variables and the output variable. For example, in a linear model, the relationship between two input variables and the output variable is represented by a plane, while in a non-linear model, the relationship is represented
  • 38. Applications of Non-Linearity: • Non-linear models are useful in a wide range of applications, including image and speech recognition, natural language processing, and time series forecasting. For example, non-linear models can be used to classify images based on their content, such as identifying objects in an image. Non-linear models can also be used to recognize speech, by modeling the relationship between the audio signal and the spoken words.
  • 39. • In natural language processing, non-linear models can be used to understand the meaning of text, such as identifying the subject, predicate, and object in a sentence. In time series forecasting, non-linear models can be used to predict future values based on historical data, such as predicting stock prices.
  • 40. Examples of Non-Linear Models: • There are many different types of non-linear models, each with their own strengths and weaknesses. Some examples of non-linear models include: • Neural networks: Neural networks are a type of non-linear model that are inspired by the structure of the human brain. They consist of layers of interconnected nodes, or artificial neurons, that process and transmit information. Neural networks can be used for a wide range of tasks, including image and speech recognition, natural language processing, and time series forecasting.
  • 41. • Support Vector Machines (SVMs): Support Vector Machines are a type of non-linear model that are used for classification and regression tasks. SVMs find the hyperplane that separates different classes of data in a high-dimensional space. They can also be used to classify data that is not linearly separable by using a non- linear kernel function to transform the data into a higher-dimensional space
  • 42. • Random Forest: Random Forest is a type of non-linear model that is used for classification and regression tasks. It is an ensemble method that combines multiple decision trees to make predictions. The decision trees are generated by randomly selecting subsets of the input variables and building a tree based on those variables. • Gradient Boosting: Gradient Boosting is another type of non-linear model that is used for classification and regression tasks. It is an ensemble method that combines multiple decision trees to make predictions. The decision trees are generated by building multiple trees in sequence, where each tree tries to correct the mistakes of the previous tree.
  • 43. Conclusion • In conclusion, non-linearity in machine learning is a powerful tool for capturing complex relationships between inputs and outputs that cannot be represented by simple linear functions. It is used in a wide range of machine learning tasks, including classification, regression, and clustering. Non-linear models offer greater flexibility and interpretability than linear models and are widely used in neural networks, decision trees, and support vector machines.