The document discusses machine learning techniques including regression, Bayesian learning, and support vector machines. It provides details on linear regression, logistic regression, Bayes' theorem, concept learning, the Bayes optimal classifier, naive Bayes classifier, and Bayesian belief networks. The document is a slide presentation given by Dr. Radhey Shyam on machine learning techniques, outlining these various topics in greater detail over multiple slides.
Visit to a blind student's school🧑🦯🧑🦯(community medicine)
Regression, Bayesian Learning and Support vector machine
1. Machine Learning Techniques
Regression
Bayesian Learning
Support Vector Machine
Dr. Radhey Shyam
Professor
Dept. of Computer Science & Engg., BIET Lucknow
Following slides have been prepared by Radhey Shyam, with grateful acknowledgement of others who made their course
contents freely available. Feel free to reuse these slides for your own academic purposes. Please send feedback at
shyam0058@gmail.com
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 1 / 64
2. OutlineOutline
1 Regression
Simple Linear Regression
Logistic Regression
2 Bayesian Learning
Bayes Theorem
Concept Learning
Bayes Optimal Classifier
NAIVE BAYES CLASSIFIER
BAYESIAN BELIEF NETWORKS
EM Algorithm
3 Support Vector Machine
4 Conclusion
5 References
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 2 / 64
3. RegressionRegression
Linear Regression— In statistics, linear regression is a linear approach to
modeling the relationship between a scalar response (or dependent
variable) and one or more explanatory variables (or independent variables).
The case of one explanatory variable is called simple linear regression.
Linear regression is used to predict the continuous dependent variable
using a given set of independent variables.
Linear Regression is used for solving Regression problem.
In Linear regression, value of continuous variables are predicted.
Linear regression tried to find the best fit line, through which the
output can be easily predicted.
Least square estimation method1 is used for estimation of accuracy2.
The output for Linear Regression must be a continuous value, such as
price, age, etc.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 3 / 64
4. RegressionRegression (cont.)
In Linear regression, it is required that relationship between
dependent variable and independent variable must be linear.
In linear regression, there may be collinearity3 between the
independent variables.
Some Regression examples:
Regression analysis is used in stats to find trends in data. For
example, you might guess that there is a connection between how
much you eat and how much you weigh; regression analysis can help
you quantify that.
Regression analysis will provide you with an equation for a graph so
that you can make predictions about your data. For example, if
you’ve been putting on weight over the last few years, it can predict
how much you’ll weigh in ten years time if you continue to put on
weight at the same rate.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 4 / 64
5. RegressionRegression (cont.)
It is also called simple linear regression. It establishes the relationship
between two variables using a straight line. If two or more
explanatory variables have a linear relationship with the dependent
variable, the regression is called a multiple linear regression.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 5 / 64
6. RegressionRegression (cont.)
Logistic Regression— use to resolve classification problems where given
an element you have to classify the same in N categories. Typical
examples are for example given a mail to classify it as spam or not, or
given a vehicle find to which category it belongs (car, truck, van, etc.).
That’s basically the output is a finite set of descrete values.
Logistic Regression is used to predict the categorical dependent
variable using a given set of independent variables.
Logistic regression is used for solving Classification problems.
In logistic Regression, we predict the values of categorical variables.
In Logistic Regression, we find the S-curve by which we can classify
the samples.
Maximum likelihood estimation method is used for estimation of
accuracy.
The output of Logistic Regression must be a Categorical value such
as 0 or 1, Yes or No, etc.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 6 / 64
7. RegressionRegression (cont.)
In Logistic regression, it is not required to have the linear relationship
between the dependent and independent variable.
In logistic regression, there should not be collinearity between the
independent variable.
1
The least squares method is a statistical procedure to find the best fit for a set of
data points by minimizing the sum of the offsets of points from the plotted curve. Least
squares regression is used to predict the behavior of dependent variables.
2
Accuracy is how close a measured value is to the actual value. Precision is how
close the measured values are to each other.
3
Collinearity is a condition in which some of the independent variables are highly
correlated.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 7 / 64
8. Bayesian LearningBayesian Learning
Bayesian Decision Theory came long before Version Spaces, Decision
Tree Learning and Neural Networks. It was studied in the field of
Statistical Theory and more specifically, in the field of Pattern
Recognition.
Bayesian Decision Theory is at the basis of important learning
schemes such as the Na¨ıve Bayes Classifier, Learning Bayesian Belief
Networks and the EM Algorithm.
Bayesian Decision Theory is also useful as it provides a framework
within which many non-Bayesian classifiers can be studied.
Bayes Theorem
Goal — To determine the most probable hypothesis, given the data
D plus any initial knowledge about the prior probabilities of the
various hypotheses in H.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 8 / 64
9. Bayesian LearningBayesian Learning (cont.)
Prior probability of h, P(h) — it reflects any background
knowledge we have about the chance that h is a correct hypothesis
(before having observed the data).
Prior probability of D, P(D) — it reflects the probability that
training data D will be observed given no knowledge about which
hypothesis h holds.
Conditional Probability of observation D, P(D|h) — it denotes
the probability of observing data D given some world in which
hypothesis h holds.
Posterior probability of h, P(h|D) — it represents the probability
that h holds given the observed training data D. It reflects our
confidence that h holds after we have seen the training data D and it
is the quantity that Machine Learning researchers are interested in.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 9 / 64
10. Bayesian LearningBayesian Learning (cont.)
Bayes Theorem allows us to compute P(h|D) —
P(h|D) = P(D|h)P(h)/P(D)
Maximum A Posteriori (MAP) Hypothesis and Maximum Likelihood
Goal — To find the most probable hypothesis h from a set of
candidate hypotheses H given the observed data D. MAP
Hypothesis,
hMAP = argmax
h∈H
P(h|D)
= argmax
h∈H
P(D|h)P(h)/P(D)
= argmax
h∈H
P(D|h)P(h)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 10 / 64
11. Bayesian LearningBayesian Learning (cont.)
If every hypothesis in H is equally probable a priori, we only need to
consider the likelihood of the data D given h, P(D|h). Then, hMAP
becomes the Maximum Likelihood,
hML = argmax
h∈H
P(D|h)P(h)
Concept Learning
Inducing general functions from specific training examples is a main
issue of machine learning.
Concept Learning: Acquiring the definition of a general category from
given sample positive and negative training examples of the category.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 11 / 64
12. Bayesian LearningBayesian Learning (cont.)
Concept Learning can seen as a problem of searching through a
predefined space of potential hypotheses for the hypothesis that best
fits the training examples.
The hypothesis space has a general-to-specific ordering of hypotheses,
and the search can be efficiently organized by taking advantage of a
naturally occurring structure over the hypothesis space.
A Formal Definition for Concept Learning:
Inferring a boolean-valued function from training examples of
its input and output.
An example for concept-learning is the learning of bird-concept from
the given examples of birds (positive examples) and non-birds
(negative examples).
We are trying to learn the definition of a concept from given examples.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 12 / 64
13. Bayesian LearningBayesian Learning (cont.)
A set of example days, and each is described by six attributes.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 13 / 64
14. Bayesian LearningBayesian Learning (cont.)
The task is to learn to predict the value of EnjoySport for arbitrary
day, based on the values of its attribute values.
EnjoySport – Hypothesis Representation
Each hypothesis consists of a conjuction of constraints on the
instance attributes.
Each hypothesis will be a vector of six constraints, specifying the
values of the six attributes
– (Sky, AirTemp, Humidity, Wind, Water, and Forecast).
Each attribute will be:
? - indicating any value is acceptable for the attribute (don’t care)
singlevalue – specifying a single required value (ex. Warm)
(specific)
φ - indicating no value is acceptable for the attribute (no value)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 14 / 64
15. Bayesian LearningBayesian Learning (cont.)
Hypothesis Representation
A hypothesis:
Sky AirTemp Humidity Wind Water Forecast
< Sunny, ?, ?, Strong, ?, Same >
The most general hypothesis – that every day is a positive example
<?, ?, ?, ?, ?, ? >
The most specific hypothesis – that no day is a positive example
< φ, φ, φ, φ, φ, φ >
EnjoySport concept learning task requires learning the sets of days
for which EnjoySport = yes, describing this set by a conjunction of
constraints over the instance attributes.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 15 / 64
16. Bayesian LearningBayesian Learning (cont.)
EnjoySport Concept Learning Task
Given
Instances X : set of all possible days, each described by the attributes
Sky – (values: Sunny, Cloudy, Rainy)
AirTemp – (values: Warm, Cold)
Humidity – (values: Normal, High)
Wind – (values: Strong, Weak)
Water – (values: Warm, Cold)
Forecast – (values: Same, Change)
TargetConcept(Function) c : EnjoySport : X → {0, 1}
Hypotheses H : Each hypothesis is described by a conjunction of
constraints on the attributes.
TrainingExamples D: positive and negative examples of the target
function
Determine
A hypothesis h in H such that h(x) = c(x) for all x in D.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 16 / 64
17. Bayesian LearningBayesian Learning (cont.)
The Inductive Learning Hypothesis
Although the learning task is to determine a hypothesis h identical to
the target concept cover the entire set of instances X, the only
information available about c is its value over the training examples.
Inductive learning algorithms can at best guarantee that the
output hypothesis fits the target concept over the training data.
Lacking any further information, our assumption is that the best
hypothesis regarding unseen instances is the hypothesis that best fits
the observed training data. This is the fundamental assumption of
inductive learning.
The Inductive Learning Hypothesis - Any hypothesis found to
approximate the target function well over a sufficiently large set of
training examples will also approximate the target function well over
other unobserved examples.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 17 / 64
18. Bayesian LearningBayesian Learning (cont.)
Concept Learning as Search
Concept learning can be viewed as the task of searching through a
large space of hypotheses implicitly defined by the hypothesis
representation.
The goal of this search is to find the hypothesis that best fits the
training examples.
By selecting a hypothesis representation, the designer of the learning
algorithm implicitly defines the space of all hypotheses that the
program can ever represent and therefore can ever learn.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 18 / 64
19. Bayesian LearningBayesian Learning (cont.)
Bayes Optimal Classifier
We have considered the question so far “what is the most probable
hypothesis given the training data?”
In fact, the question that is often of most significance is the closely
related question “what is the most probable classification of the new
instance given the training data?”
Although it may seem that this second question can be answered by
simply applying the MAP hypothesis to the new instance, in fact it is
possible to do better.
To develop some intuitions consider a hypothesis space containing
three hypotheses, h1, h2, and h3.
Suppose that the posterior probabilities of these hypotheses given the
training data are .4, .3, and .3 respectively.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 19 / 64
20. Bayesian LearningBayesian Learning (cont.)
Thus, h1 is the MAP hypothesis.
Suppose a new instance x is encountered, which is classified positive
by h1 , but negative by h2 and h3.
Taking all hypotheses into account, the probability that x is positive
is .4 (the probability associated with h1), and the probability that it is
negative is therefore .6. The most probable classification (negative) in
this case is different from the classification generated by the MAP
hypothesis.
In general, the most probable classification of the new instance is
obtained by combining the predictions of all hypotheses, weighted by
their posterior probabilities.
If the possible classification of the new example can take on any value
vj from some set V , then the probability P(vj|D) that the correct
classification for the new instance is v;, is just P(vj|D).
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 20 / 64
21. Bayesian LearningBayesian Learning (cont.)
The optimal classification of the new instance is the value vj, for
which P(vj|D) is maximum.
P(Vj|D) =
hi∈H
P(vj|hi)P(hi|D)
The optimal classification of the new instance is the value vj, for
which P(vj|D) is maximum.
P(Vj|D) = argmax
vj∈V hi∈H
P(vj|hi)P(hi|D) (1)
Any system that classifies new instances according to Equation (1) is
called a Bayes optimal classifier, or Bayes optimal learner. No
other classification method using the same hypothesis space and same
prior knowledge can outperform.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 21 / 64
22. Bayesian LearningBayesian Learning (cont.)
NAIVE BAYES CLASSIFIER
One highly practical Bayesian learning method is the naive Bayes
learner, often called the naive Bayes classifier.
In some domains its performance has been shown to be comparable
to that of neural network and decision tree learning.
This section introduces the naive Bayes classifier; the next section
applies it to the practical problem of learning to classify natural
language text documents.
The naive Bayes classifier applies to learning tasks where each
instance x is described by a conjunction of attribute values and where
the target function f(x) can take on any value from some finite set
V . A set of training examples of the target function is provided, and
a new instance is presented, described by the tuple of attribute values
< a1, a2...an >. The learner is asked to predict the target value, or
classification, for this new instance.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 22 / 64
23. Bayesian LearningBayesian Learning (cont.)
The Bayesian approach to classifying the new instance is to assign the
most probable target value, vMAP , given the attribute values
< a1, a2...an > that describe the instance.
vMAP = argmax
vj∈V
P(vj|a1, a2...an)
We can use Bayes theorem to rewrite this expression as
vMAP = argmax
vj∈V
P(a1, a2...an|vj)P(vj)
P(a1, a2...an)
(2)
= argmax
vj∈V
P(a1, a2...an|vj)P(vj) (3)
Now we could attempt to estimate the two terms in Equation (3)
based on the training data.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 23 / 64
24. Bayesian LearningBayesian Learning (cont.)
It is easy to estimate each of the P(vj) simply by counting the
frequency with which each target value vj occurs in the training data.
However, estimating the different P(a1, a2...an|vj) terms in this
fashion is not feasible unless we have a very large set of training data.
The problem is that the number of these terms is equal to the number
of possible instances times the number of possible target values.
Therefore, we need to see every instance in the instance space many
times in order to obtain reliable estimates.
The naive Bayes classifier is based on the simplifying assumption that
the attribute values are conditionally independent given the target
value.
In other words, the assumption is that given the target value of the
instance, the probability of observing the conjunction a1, a2...an is
just the product of the probabilities for the individual attributes:
P(a1, a2...an|vj) = i P(ai|vj). Substituting this into Equation (3).
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 24 / 64
25. Bayesian LearningBayesian Learning (cont.)
we have the approach used by the naive Bayes classifier.
vNB = argmax
vj∈V
P(vj)
i
P(ai|vj) (4)
where vNB denotes the target value output by the naive Bayes
classifier.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 25 / 64
26. Bayesian LearningBayesian Learning (cont.)
BAYESIAN BELIEF
NETWORKS
Abbreviation : BBN (Bayesian Belief Network)
Synonyms: Bayes (ian) network, Bayes(ian) model, Belief network,
Decision network, or probabilistic directed acyclic graphical model.
A BBN is a probabilistic graphical model that represents a set of
variables and their conditional dependencies via a Directed Acyclic
Graph (DAG).
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 26 / 64
27. Bayesian LearningBayesian Learning (cont.)
BBNs enable us to model and reason about uncertainty. BBNs
accommodate both subjective probabilities and probabilities based on
objective data.
The most important use of BBNs is in revising probabilities in the
light of actual observations of events.
Nodes represent variables in the Bayesian sense: observable
quantities, hidden variables or hypotheses. Edges represent
conditional dependencies.
Each node is associated with a probability function that takes, as
input, a particular set of probabilities for values for the node’s parent
variables, and outputs the probability of the values of the variable
represented by the node.
Prior Probabilities: e.g. P(RAIN)
Conditional Probabilities: e.g. P(SPRINKLER | RAIN)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 27 / 64
28. Bayesian LearningBayesian Learning (cont.)
Joint Probability Function: P(GRASS WET, SPRINKLER, RAIN) =
P(GRASS WET | RAIN, SPRINKLER) * P(SPRINKLER | RAIN) * P
( RAIN)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 28 / 64
29. Bayesian LearningBayesian Learning (cont.)
Typically the probability functions are described in table form.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 29 / 64
30. Bayesian LearningBayesian Learning (cont.)
EM Algorithm
The EM algorithm was explained and given its name in a classic
1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin.
• EM is typically used to compute maximum likelihood estimates
given incomplete samples.
• The EM algorithm estimates the parameters of a model
iteratively.
– Starting from some initial guess, each iteration consists of
an E step (Expectation step)
an M step (Maximization step)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 30 / 64
31. Bayesian LearningBayesian Learning (cont.)
600.465 - Intro to NLP - J. Eisner 4
Guess of
unknown
parameters
(probabilities)
initial
guess
M step
Observed
structure
(words, ice cream)
General Idea
Guess of unknown
hidden structure
(tags, parses, weather)
E step
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 31 / 64
32. Bayesian LearningBayesian Learning (cont.)
A general Statement
Consider a sample (X1
...Xn
) which is drawn from a probability distribution P(X|A) where A are
parameters. If the Xs are independent with probability density function P(Xi
|A) the joint
probability of the whole set is
A)|XP(=A)|X...XP( i
n
=1i
n1
this may be maximised with respect to A to give the maximum likelihood estimates.
MLE(Maximum likelihood estimation)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 32 / 64
33. Bayesian LearningBayesian Learning (cont.)
• Given
– A sample X={X1, …, Xn}
– A vector of parameters θ
• We define
– Likelihood of the data: P(X | θ)
– Log-likelihood of the data: L(θ)=log P(X|θ)
• Given X, find
)(maxarg
LML
MLE(Maximum likelihood estimation)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 33 / 64
34. Bayesian LearningBayesian Learning (cont.)
Basic setting in EM
• X is a set of data points: observed data
• Θ is a parameter vector.
• EM is a method to find θML where
• Calculating P(X | θ) directly is hard.
• Calculating P(X,Y|θ) is much simpler, where Y is “hidden” data (or
“missing” data).
Θ)|P(X
L(Θ(=θ
Ωθ
ΩθML
log
xam
gra
xam
gra
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 34 / 64
35. Bayesian LearningBayesian Learning (cont.)
The basic EM strategy
• Z = (X, Y)
– Z: complete data (“augmented data”)
– X: observed data (“incomplete” data)
– Y: hidden data (“missing” data)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 35 / 64
36. Bayesian LearningBayesian Learning (cont.)
The “missing” data Y
• Y need not necessarily be missing in the practical sense of the
word.
• It may just be a conceptually convenient technical device to
simplify the calculation of P(x |θ).
• There could be many possible Ys.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 36 / 64
37. Bayesian LearningBayesian Learning (cont.)
The EM algorithm
• Start with initial estimate, θ0
• Repeat until convergence
– E-step: calculate
– M-step: find
),(maxarg)1( tt
Q
θ)|y,P(x)θ,x|P(y=)θQ(θ( i
t
n
=i y
i
t
log
1
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 37 / 64
38. Bayesian LearningBayesian Learning (cont.)
The Q-function
The Q-function is the expected value of the complete data log-likelihood P(X,Y|
θ) with respect to Y given X and θt
.
& where,
– Y is a random vector.
– X=(x1, x2, …, xn) is a constant (vector).
– Θt
is the current parameter estimate and is a constant (vector).
– Θ is the normal variable (vector) that we wish to adjust.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 38 / 64
39. Bayesian LearningBayesian Learning (cont.)
Expectation Maximization EM
When to use
data is only partially observable
unsupervised clustering: target value unobservable
supervised learning: some instance attributes unobservable
Applications
training Bayesian Belief Networks
unsupervised clustering
learning hidden Markov models
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 39 / 64
40. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE
History of SVM
SVM is related to statistical learning theory.
SVM was first introduced in 1992.
SVM becomes popular because of its success in handwritten digit
recognition 1.1% test error rate for SVM. This is the same as the
error rates of a carefully constructed neural network.
SVM is now regarded as an important example of “kernel methods”,
one of the key area in machine learning
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 40 / 64
41. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Binary Classification
Given training data (xi, yi) for i = 1 . . . N, with
xi ∈ Rd and yi ∈ {−1, 1}, learn a classifier f(x)
such that
f(xi)
(
≥ 0 yi = +1
< 0 yi = −1
i.e. yif(xi) > 0 for a correct classification.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 41 / 64
42. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Linear separability
linearly
separable
not
linearly
separable
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 42 / 64
43. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Linear classifiers
X2
X1
A linear classifier has the form
• in 2D the discriminant is a line
• is the normal to the line, and b the bias
• is known as the weight vector
f(x) = 0
f(x) = w>x + b
f(x) > 0f(x) < 0
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 43 / 64
44. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Linear classifiers
A linear classifier has the form
• in 3D the discriminant is a plane, and in nD it is a hyperplane
For a K-NN classifier it was necessary to `carry’ the training data
For a linear classifier, the training data is used to learn w and then discarded
Only w is needed for classifying new data
f(x) = 0
f(x) = w>x + b
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 44 / 64
45. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Given linearly separable data xi labelled into two categories yi = {-1,1} ,
find a weight vector w such that the discriminant function
separates the categories for i = 1, .., N
• how can we find this separating hyperplane ?
The Perceptron Classifier
f(xi) = w>xi + b
The Perceptron Algorithm
Write classifier as
• Initialize w = 0
• Cycle though the data points { xi, yi }
• if xi is misclassified then
• Until all the data is correctly classified
w ← w + α sign(f(xi)) xi
f(xi) = ˜w>˜xi + w0 = w>xi
where w = (˜w, w0), xi = (˜xi, 1)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 45 / 64
46. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
For example in 2D
X2
X1
X2
X1
w
before update after update
w
NB after convergence w =
PN
i αixi
• Initialize w = 0
• Cycle though the data points { xi, yi }
• if xi is misclassified then
• Until all the data is correctly classified
w ← w + α sign(f(xi)) xi
xi
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 46 / 64
47. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
• if the data is linearly separable, then the algorithm will converge
• convergence can be slow …
• separating line close to training data
• we would prefer a larger margin for generalization
-15 -10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
Perceptron
example
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 47 / 64
48. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
What is the best w?
• maximum margin solution: most stable under perturbations of the inputs
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 48 / 64
49. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Tennis example
Humidity
Temperature
= play tennis
= do not play tennis
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 49 / 64
50. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Linear Support Vector
Machines
x1
x2
=+1
=-1
Data: <xi,yi>, i=1,..,l
xi Rd
yi {-1,+1}
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 50 / 64
51. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
=-1
=+1
Data: <xi,yi>, i=1,..,l
xi Rd
yi {-1,+1}
All hyperplanes in Rd
are parameterize by a vector (w) and a constant b.
Can be expressed as w•x+b=0 (remember the equation for a hyperplane
from algebra!)
Our aim is to find such a hyperplane f(x)=sign(w•x+b), that
correctly classify our data.
f(x)
Linear SVM 2
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 51 / 64
52. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
d+
d-
Definitions
Define the hyperplane H such that:
xi•w+b +1 when yi =+1
xi•w+b -1 when yi =-1
d+ = the shortest distance to the closest positive point
d- = the shortest distance to the closest negative point
The margin of a separating hyperplane is d+
+ d-
.
H
H1 and H2 are the planes:
H1: xi•w+b = +1
H2: xi•w+b = -1
The points on the planes
H1 and H2 are the
Support Vectors
H1
H2
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 52 / 64
53. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Maximizing the margin
d+
d-
We want a classifier with as big margin as possible.
Recall the distance from a point(x0,y0) to a line:
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2
+B2
)
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||
The distance between H1 and H2 is: 2/||w||
In order to maximize the margin, we need to minimize ||w||. With the
condition that there are no datapoints between H1 and H2:
xi•w+b +1 when yi =+1
xi•w+b -1 when yi =-1 Can be combined into yi(xi•w) 1
H1
H2
H
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 53 / 64
54. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Constrained Optimization
Problem
0and0subject to
2
1
Maximize
:yieldsgsimplifyinand,intobackngsubstituti0,to
themsettings,derivativetheTaking0.bemustandboth
respectwithofderivativepartialtheextremum,At the
1)(||||
2
1
),,(
where,),,(infmaximize:methodLagrangian
allfor1)(subject to||||Minimize
,
i
i
ii
i ji
jijijii
i
iii
ii
y
yy
L
b
L
bybL
bL
iby
xx
w
wxww
w
wxwww
w
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 54 / 64
55. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Quadratic Programming
• Why is this reformulation a good thing?
• The problem
is an instance of what is called a positive, semi-definite
programming problem
• For a fixed real-number accuracy, can be solved in
O(n log n) time = O(|D|2 log |D|2)
0and0subject to
2
1
Maximize
,
i
i
ii
i ji
jijijii
y
yy
xx
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 55 / 64
56. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Problems with linear SVM
=-1
=+1
What if the decision function is not a linear?
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 56 / 64
57. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Kernel Trick
)2,,(spacein the
separablelinearlyarepointsData
21
2
2
2
1 xxxx
2
,
),(
Here,directly!computeeasy tooftenis:thingCool
)()(),(Define
)()(
2
1
maximizewant toWe
jiji
jiji
i ji
jijijii
K
K
FFK
FFyy
xxxx
xxxx
xx
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 57 / 64
58. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Other Kernels
The polynomial kernel
K(xi,xj) = (xi•xj + 1)p
, where p is a tunable parameter.
Evaluating K only require one addition and one exponentiation
more than the original dot product.
Gaussian kernels (also called radius basis functions)
K(xi,xj) = exp(||xi-xj ||2
/22
)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 58 / 64
59. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Overtraining/overfitting
=-1
=+1
An example: A botanist really knowing trees.Everytime he sees a new tree,
he claims it is not a tree.
A well known problem with machine learning methods is overtraining.
This means that we have learned the training data very well, but
we can not classify unseen examples correctly.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 59 / 64
60. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
Overtraining/overfitting 2
It can be shown that: The portion, n, of unseen data that will be
missclassified is bounded by:
n Number of support vectors / number of training examples
A measure of the risk of overtraining with SVM (there are also other
measures).
Ockham´s razor principle: Simpler system are better than more complex ones.
In SVM case: fewer support vectors mean a simpler representation of the
hyperplane.
Example: Understanding a certain cancer if it can be described by one gene
is easier than if we have to describe it with 5000.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 60 / 64
61. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.)
A practical example, protein
localization
• Proteins are synthesized in the cytosol.
• Transported into different subcellular
locations where they carry out their
functions.
• Aim: To predict in what location a
certain protein will end up!!!
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 61 / 64
62. ConclusionConclusion
Regression is a statistical technique that uses to establish relationship
between one dependent variable and one or more independent
variables.
Bayesian learning uses Bayes theorem to determine the conditional
probability of a hypotheses given some evidence or observations.
SVM is a useful alternative to neural networks.
Two key concepts of SVM: maximize the margin and the kernel trick.
Many SVM implementations are available on the web for you to try
on your data set!
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 62 / 64
63. ReferencesReferences
[1] Thomas M. Mitchell.
Machine Learning.
McGraw-Hill, Inc., USA, 1 edition, 1997.
[2] Christopher M. Bishop.
Pattern Recognition and Machine Learning.
Springer, 2006.
[3] Stephen Marsland.
Machine Learning: An Algorithmic Perspective, Second Edition.
2nd edition, 2014.
[4] Ethem Alpaydin.
Introduction to Machine Learning.
Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 3 edition, 2014.
[5] B E Boser, I M Guyon, and V N Vapnik.
A training algorithm for optimal margin classifiers.
In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, volume 5, pages 144–152,
Springer-Verlag, Berlin Heidelberg, 1992.
[6] V. Vapnik.
The nature of statistical learning theory.
In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Springer, 1999.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 63 / 64
64. Thank You!
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 64 / 64