SlideShare a Scribd company logo
1 of 64
Download to read offline
Machine Learning Techniques
Bayesian Learning
Support Vector Machine
Dr. Radhey Shyam
Dept. of Computer Science & Engg., BIET Lucknow
Following slides have been prepared by Radhey Shyam, with grateful acknowledgement of others who made their course
contents freely available. Feel free to reuse these slides for your own academic purposes. Please send feedback at
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 1 / 64
1 Regression
Simple Linear Regression
Logistic Regression
2 Bayesian Learning
Bayes Theorem
Concept Learning
Bayes Optimal Classifier
EM Algorithm
3 Support Vector Machine
4 Conclusion
5 References
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 2 / 64
Linear Regression— In statistics, linear regression is a linear approach to
modeling the relationship between a scalar response (or dependent
variable) and one or more explanatory variables (or independent variables).
The case of one explanatory variable is called simple linear regression.
Linear regression is used to predict the continuous dependent variable
using a given set of independent variables.
Linear Regression is used for solving Regression problem.
In Linear regression, value of continuous variables are predicted.
Linear regression tried to find the best fit line, through which the
output can be easily predicted.
Least square estimation method1 is used for estimation of accuracy2.
The output for Linear Regression must be a continuous value, such as
price, age, etc.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 3 / 64
RegressionRegression (cont.)
In Linear regression, it is required that relationship between
dependent variable and independent variable must be linear.
In linear regression, there may be collinearity3 between the
independent variables.
Some Regression examples:
Regression analysis is used in stats to find trends in data. For
example, you might guess that there is a connection between how
much you eat and how much you weigh; regression analysis can help
you quantify that.
Regression analysis will provide you with an equation for a graph so
that you can make predictions about your data. For example, if
you’ve been putting on weight over the last few years, it can predict
how much you’ll weigh in ten years time if you continue to put on
weight at the same rate.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 4 / 64
RegressionRegression (cont.)
It is also called simple linear regression. It establishes the relationship
between two variables using a straight line. If two or more
explanatory variables have a linear relationship with the dependent
variable, the regression is called a multiple linear regression.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 5 / 64
RegressionRegression (cont.)
Logistic Regression— use to resolve classification problems where given
an element you have to classify the same in N categories. Typical
examples are for example given a mail to classify it as spam or not, or
given a vehicle find to which category it belongs (car, truck, van, etc.).
That’s basically the output is a finite set of descrete values.
Logistic Regression is used to predict the categorical dependent
variable using a given set of independent variables.
Logistic regression is used for solving Classification problems.
In logistic Regression, we predict the values of categorical variables.
In Logistic Regression, we find the S-curve by which we can classify
the samples.
Maximum likelihood estimation method is used for estimation of
The output of Logistic Regression must be a Categorical value such
as 0 or 1, Yes or No, etc.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 6 / 64
RegressionRegression (cont.)
In Logistic regression, it is not required to have the linear relationship
between the dependent and independent variable.
In logistic regression, there should not be collinearity between the
independent variable.
The least squares method is a statistical procedure to find the best fit for a set of
data points by minimizing the sum of the offsets of points from the plotted curve. Least
squares regression is used to predict the behavior of dependent variables.
Accuracy is how close a measured value is to the actual value. Precision is how
close the measured values are to each other.
Collinearity is a condition in which some of the independent variables are highly
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 7 / 64
Bayesian LearningBayesian Learning
Bayesian Decision Theory came long before Version Spaces, Decision
Tree Learning and Neural Networks. It was studied in the field of
Statistical Theory and more specifically, in the field of Pattern
Bayesian Decision Theory is at the basis of important learning
schemes such as the Na¨ıve Bayes Classifier, Learning Bayesian Belief
Networks and the EM Algorithm.
Bayesian Decision Theory is also useful as it provides a framework
within which many non-Bayesian classifiers can be studied.
Bayes Theorem
Goal — To determine the most probable hypothesis, given the data
D plus any initial knowledge about the prior probabilities of the
various hypotheses in H.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 8 / 64
Bayesian LearningBayesian Learning (cont.)
Prior probability of h, P(h) — it reflects any background
knowledge we have about the chance that h is a correct hypothesis
(before having observed the data).
Prior probability of D, P(D) — it reflects the probability that
training data D will be observed given no knowledge about which
hypothesis h holds.
Conditional Probability of observation D, P(D|h) — it denotes
the probability of observing data D given some world in which
hypothesis h holds.
Posterior probability of h, P(h|D) — it represents the probability
that h holds given the observed training data D. It reflects our
confidence that h holds after we have seen the training data D and it
is the quantity that Machine Learning researchers are interested in.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 9 / 64
Bayesian LearningBayesian Learning (cont.)
Bayes Theorem allows us to compute P(h|D) —
P(h|D) = P(D|h)P(h)/P(D)
Maximum A Posteriori (MAP) Hypothesis and Maximum Likelihood
Goal — To find the most probable hypothesis h from a set of
candidate hypotheses H given the observed data D. MAP
hMAP = argmax
= argmax
= argmax
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 10 / 64
Bayesian LearningBayesian Learning (cont.)
If every hypothesis in H is equally probable a priori, we only need to
consider the likelihood of the data D given h, P(D|h). Then, hMAP
becomes the Maximum Likelihood,
hML = argmax
Concept Learning
Inducing general functions from specific training examples is a main
issue of machine learning.
Concept Learning: Acquiring the definition of a general category from
given sample positive and negative training examples of the category.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 11 / 64
Bayesian LearningBayesian Learning (cont.)
Concept Learning can seen as a problem of searching through a
predefined space of potential hypotheses for the hypothesis that best
fits the training examples.
The hypothesis space has a general-to-specific ordering of hypotheses,
and the search can be efficiently organized by taking advantage of a
naturally occurring structure over the hypothesis space.
A Formal Definition for Concept Learning:
Inferring a boolean-valued function from training examples of
its input and output.
An example for concept-learning is the learning of bird-concept from
the given examples of birds (positive examples) and non-birds
(negative examples).
We are trying to learn the definition of a concept from given examples.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 12 / 64
Bayesian LearningBayesian Learning (cont.)
A set of example days, and each is described by six attributes.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 13 / 64
Bayesian LearningBayesian Learning (cont.)
The task is to learn to predict the value of EnjoySport for arbitrary
day, based on the values of its attribute values.
EnjoySport – Hypothesis Representation
Each hypothesis consists of a conjuction of constraints on the
instance attributes.
Each hypothesis will be a vector of six constraints, specifying the
values of the six attributes
– (Sky, AirTemp, Humidity, Wind, Water, and Forecast).
Each attribute will be:
? - indicating any value is acceptable for the attribute (don’t care)
singlevalue – specifying a single required value (ex. Warm)
φ - indicating no value is acceptable for the attribute (no value)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 14 / 64
Bayesian LearningBayesian Learning (cont.)
Hypothesis Representation
A hypothesis:
Sky AirTemp Humidity Wind Water Forecast
< Sunny, ?, ?, Strong, ?, Same >
The most general hypothesis – that every day is a positive example
<?, ?, ?, ?, ?, ? >
The most specific hypothesis – that no day is a positive example
< φ, φ, φ, φ, φ, φ >
EnjoySport concept learning task requires learning the sets of days
for which EnjoySport = yes, describing this set by a conjunction of
constraints over the instance attributes.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 15 / 64
Bayesian LearningBayesian Learning (cont.)
EnjoySport Concept Learning Task
Instances X : set of all possible days, each described by the attributes
Sky – (values: Sunny, Cloudy, Rainy)
AirTemp – (values: Warm, Cold)
Humidity – (values: Normal, High)
Wind – (values: Strong, Weak)
Water – (values: Warm, Cold)
Forecast – (values: Same, Change)
TargetConcept(Function) c : EnjoySport : X → {0, 1}
Hypotheses H : Each hypothesis is described by a conjunction of
constraints on the attributes.
TrainingExamples D: positive and negative examples of the target
A hypothesis h in H such that h(x) = c(x) for all x in D.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 16 / 64
Bayesian LearningBayesian Learning (cont.)
The Inductive Learning Hypothesis
Although the learning task is to determine a hypothesis h identical to
the target concept cover the entire set of instances X, the only
information available about c is its value over the training examples.
Inductive learning algorithms can at best guarantee that the
output hypothesis fits the target concept over the training data.
Lacking any further information, our assumption is that the best
hypothesis regarding unseen instances is the hypothesis that best fits
the observed training data. This is the fundamental assumption of
inductive learning.
The Inductive Learning Hypothesis - Any hypothesis found to
approximate the target function well over a sufficiently large set of
training examples will also approximate the target function well over
other unobserved examples.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 17 / 64
Bayesian LearningBayesian Learning (cont.)
Concept Learning as Search
Concept learning can be viewed as the task of searching through a
large space of hypotheses implicitly defined by the hypothesis
The goal of this search is to find the hypothesis that best fits the
training examples.
By selecting a hypothesis representation, the designer of the learning
algorithm implicitly defines the space of all hypotheses that the
program can ever represent and therefore can ever learn.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 18 / 64
Bayesian LearningBayesian Learning (cont.)
Bayes Optimal Classifier
We have considered the question so far “what is the most probable
hypothesis given the training data?”
In fact, the question that is often of most significance is the closely
related question “what is the most probable classification of the new
instance given the training data?”
Although it may seem that this second question can be answered by
simply applying the MAP hypothesis to the new instance, in fact it is
possible to do better.
To develop some intuitions consider a hypothesis space containing
three hypotheses, h1, h2, and h3.
Suppose that the posterior probabilities of these hypotheses given the
training data are .4, .3, and .3 respectively.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 19 / 64
Bayesian LearningBayesian Learning (cont.)
Thus, h1 is the MAP hypothesis.
Suppose a new instance x is encountered, which is classified positive
by h1 , but negative by h2 and h3.
Taking all hypotheses into account, the probability that x is positive
is .4 (the probability associated with h1), and the probability that it is
negative is therefore .6. The most probable classification (negative) in
this case is different from the classification generated by the MAP
In general, the most probable classification of the new instance is
obtained by combining the predictions of all hypotheses, weighted by
their posterior probabilities.
If the possible classification of the new example can take on any value
vj from some set V , then the probability P(vj|D) that the correct
classification for the new instance is v;, is just P(vj|D).
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 20 / 64
Bayesian LearningBayesian Learning (cont.)
The optimal classification of the new instance is the value vj, for
which P(vj|D) is maximum.
P(Vj|D) =
The optimal classification of the new instance is the value vj, for
which P(vj|D) is maximum.
P(Vj|D) = argmax
vj∈V hi∈H
P(vj|hi)P(hi|D) (1)
Any system that classifies new instances according to Equation (1) is
called a Bayes optimal classifier, or Bayes optimal learner. No
other classification method using the same hypothesis space and same
prior knowledge can outperform.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 21 / 64
Bayesian LearningBayesian Learning (cont.)
One highly practical Bayesian learning method is the naive Bayes
learner, often called the naive Bayes classifier.
In some domains its performance has been shown to be comparable
to that of neural network and decision tree learning.
This section introduces the naive Bayes classifier; the next section
applies it to the practical problem of learning to classify natural
language text documents.
The naive Bayes classifier applies to learning tasks where each
instance x is described by a conjunction of attribute values and where
the target function f(x) can take on any value from some finite set
V . A set of training examples of the target function is provided, and
a new instance is presented, described by the tuple of attribute values
< a1, >. The learner is asked to predict the target value, or
classification, for this new instance.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 22 / 64
Bayesian LearningBayesian Learning (cont.)
The Bayesian approach to classifying the new instance is to assign the
most probable target value, vMAP , given the attribute values
< a1, > that describe the instance.
vMAP = argmax
We can use Bayes theorem to rewrite this expression as
vMAP = argmax
= argmax
P(a1,|vj)P(vj) (3)
Now we could attempt to estimate the two terms in Equation (3)
based on the training data.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 23 / 64
Bayesian LearningBayesian Learning (cont.)
It is easy to estimate each of the P(vj) simply by counting the
frequency with which each target value vj occurs in the training data.
However, estimating the different P(a1,|vj) terms in this
fashion is not feasible unless we have a very large set of training data.
The problem is that the number of these terms is equal to the number
of possible instances times the number of possible target values.
Therefore, we need to see every instance in the instance space many
times in order to obtain reliable estimates.
The naive Bayes classifier is based on the simplifying assumption that
the attribute values are conditionally independent given the target
In other words, the assumption is that given the target value of the
instance, the probability of observing the conjunction a1, is
just the product of the probabilities for the individual attributes:
P(a1,|vj) = i P(ai|vj). Substituting this into Equation (3).
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 24 / 64
Bayesian LearningBayesian Learning (cont.)
we have the approach used by the naive Bayes classifier.
vNB = argmax
P(ai|vj) (4)
where vNB denotes the target value output by the naive Bayes
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 25 / 64
Bayesian LearningBayesian Learning (cont.)
Abbreviation : BBN (Bayesian Belief Network)
Synonyms: Bayes (ian) network, Bayes(ian) model, Belief network,
Decision network, or probabilistic directed acyclic graphical model.
A BBN is a probabilistic graphical model that represents a set of
variables and their conditional dependencies via a Directed Acyclic
Graph (DAG).
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 26 / 64
Bayesian LearningBayesian Learning (cont.)
BBNs enable us to model and reason about uncertainty. BBNs
accommodate both subjective probabilities and probabilities based on
objective data.
The most important use of BBNs is in revising probabilities in the
light of actual observations of events.
Nodes represent variables in the Bayesian sense: observable
quantities, hidden variables or hypotheses. Edges represent
conditional dependencies.
Each node is associated with a probability function that takes, as
input, a particular set of probabilities for values for the node’s parent
variables, and outputs the probability of the values of the variable
represented by the node.
Prior Probabilities: e.g. P(RAIN)
Conditional Probabilities: e.g. P(SPRINKLER | RAIN)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 27 / 64
Bayesian LearningBayesian Learning (cont.)
Joint Probability Function: P(GRASS WET, SPRINKLER, RAIN) =
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 28 / 64
Bayesian LearningBayesian Learning (cont.)
Typically the probability functions are described in table form.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 29 / 64
Bayesian LearningBayesian Learning (cont.)
EM Algorithm
The EM algorithm was explained and given its name in a classic
1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin.
• EM is typically used to compute maximum likelihood estimates
given incomplete samples.
• The EM algorithm estimates the parameters of a model
– Starting from some initial guess, each iteration consists of
 an E step (Expectation step)
 an M step (Maximization step)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 30 / 64
Bayesian LearningBayesian Learning (cont.)
600.465 - Intro to NLP - J. Eisner 4
Guess of
M step
(words, ice cream)
General Idea
Guess of unknown
hidden structure
(tags, parses, weather)
E step
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 31 / 64
Bayesian LearningBayesian Learning (cont.)
A general Statement
Consider a sample (X1
) which is drawn from a probability distribution P(X|A) where A are
parameters. If the Xs are independent with probability density function P(Xi
|A) the joint
probability of the whole set is
A)|XP(=A)|X...XP( i
n1 
this may be maximised with respect to A to give the maximum likelihood estimates.
MLE(Maximum likelihood estimation)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 32 / 64
Bayesian LearningBayesian Learning (cont.)
• Given
– A sample X={X1, …, Xn}
– A vector of parameters θ
• We define
– Likelihood of the data: P(X | θ)
– Log-likelihood of the data: L(θ)=log P(X|θ)
• Given X, find
)(maxarg 
MLE(Maximum likelihood estimation)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 33 / 64
Bayesian LearningBayesian Learning (cont.)
Basic setting in EM
• X is a set of data points: observed data
• Θ is a parameter vector.
• EM is a method to find θML where
• Calculating P(X | θ) directly is hard.
• Calculating P(X,Y|θ) is much simpler, where Y is “hidden” data (or
“missing” data).
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 34 / 64
Bayesian LearningBayesian Learning (cont.)
The basic EM strategy
• Z = (X, Y)
– Z: complete data (“augmented data”)
– X: observed data (“incomplete” data)
– Y: hidden data (“missing” data)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 35 / 64
Bayesian LearningBayesian Learning (cont.)
The “missing” data Y
• Y need not necessarily be missing in the practical sense of the
• It may just be a conceptually convenient technical device to
simplify the calculation of P(x |θ).
• There could be many possible Ys.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 36 / 64
Bayesian LearningBayesian Learning (cont.)
The EM algorithm
• Start with initial estimate, θ0
• Repeat until convergence
– E-step: calculate
– M-step: find
),(maxarg)1( tt
Q 
θ)|y,P(x)θ,x|P(y=)θQ(θ( i
=i y
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 37 / 64
Bayesian LearningBayesian Learning (cont.)
The Q-function
The Q-function is the expected value of the complete data log-likelihood P(X,Y|
θ) with respect to Y given X and θt
& where,
– Y is a random vector.
– X=(x1, x2, …, xn) is a constant (vector).
– Θt
is the current parameter estimate and is a constant (vector).
– Θ is the normal variable (vector) that we wish to adjust.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 38 / 64
Bayesian LearningBayesian Learning (cont.)
Expectation Maximization EM
When to use
 data is only partially observable
 unsupervised clustering: target value unobservable
 supervised learning: some instance attributes unobservable
 training Bayesian Belief Networks
 unsupervised clustering
 learning hidden Markov models
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 39 / 64
History of SVM
SVM is related to statistical learning theory.
SVM was first introduced in 1992.
SVM becomes popular because of its success in handwritten digit
recognition 1.1% test error rate for SVM. This is the same as the
error rates of a carefully constructed neural network.
SVM is now regarded as an important example of “kernel methods”,
one of the key area in machine learning
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 40 / 64
Binary Classification
Given training data (xi, yi) for i = 1 . . . N, with
xi ∈ Rd and yi ∈ {−1, 1}, learn a classifier f(x)
such that
≥ 0 yi = +1
< 0 yi = −1
i.e. yif(xi) > 0 for a correct classification.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 41 / 64
Linear separability
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 42 / 64
Linear classifiers
A linear classifier has the form
• in 2D the discriminant is a line
• is the normal to the line, and b the bias
• is known as the weight vector
f(x) = 0
f(x) = w>x + b
f(x) > 0f(x) < 0
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 43 / 64
Linear classifiers
A linear classifier has the form
• in 3D the discriminant is a plane, and in nD it is a hyperplane
For a K-NN classifier it was necessary to `carry’ the training data
For a linear classifier, the training data is used to learn w and then discarded
Only w is needed for classifying new data
f(x) = 0
f(x) = w>x + b
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 44 / 64
Given linearly separable data xi labelled into two categories yi = {-1,1} ,
find a weight vector w such that the discriminant function
separates the categories for i = 1, .., N
• how can we find this separating hyperplane ?
The Perceptron Classifier
f(xi) = w>xi + b
The Perceptron Algorithm
Write classifier as
• Initialize w = 0
• Cycle though the data points { xi, yi }
• if xi is misclassified then
• Until all the data is correctly classified
w ← w + α sign(f(xi)) xi
f(xi) = ˜w>˜xi + w0 = w>xi
where w = (˜w, w0), xi = (˜xi, 1)
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 45 / 64
For example in 2D
before update after update
NB after convergence w =
i αixi
• Initialize w = 0
• Cycle though the data points { xi, yi }
• if xi is misclassified then
• Until all the data is correctly classified
w ← w + α sign(f(xi)) xi
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 46 / 64
• if the data is linearly separable, then the algorithm will converge
• convergence can be slow …
• separating line close to training data
• we would prefer a larger margin for generalization
-15 -10 -5 0 5 10
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 47 / 64
What is the best w?
• maximum margin solution: most stable under perturbations of the inputs
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 48 / 64
Tennis example
= play tennis
= do not play tennis
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 49 / 64
Linear Support Vector
Data: <xi,yi>, i=1,..,l
xi  Rd
yi  {-1,+1}
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 50 / 64
Data: <xi,yi>, i=1,..,l
xi  Rd
yi  {-1,+1}
All hyperplanes in Rd
are parameterize by a vector (w) and a constant b.
Can be expressed as w•x+b=0 (remember the equation for a hyperplane
from algebra!)
Our aim is to find such a hyperplane f(x)=sign(w•x+b), that
correctly classify our data.
Linear SVM 2
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 51 / 64
Define the hyperplane H such that:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1
d+ = the shortest distance to the closest positive point
d- = the shortest distance to the closest negative point
The margin of a separating hyperplane is d+
+ d-
H1 and H2 are the planes:
H1: xi•w+b = +1
H2: xi•w+b = -1
The points on the planes
H1 and H2 are the
Support Vectors
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 52 / 64
Maximizing the margin
We want a classifier with as big margin as possible.
Recall the distance from a point(x0,y0) to a line:
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2
The distance between H and H1 is:
The distance between H1 and H2 is: 2/||w||
In order to maximize the margin, we need to minimize ||w||. With the
condition that there are no datapoints between H1 and H2:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 53 / 64
Constrained Optimization
  
0and0subject to
respectwithofderivativepartialtheextremum,At the
allfor1)(subject to||||Minimize
 
i ji
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 54 / 64
Quadratic Programming
• Why is this reformulation a good thing?
• The problem
is an instance of what is called a positive, semi-definite
programming problem
• For a fixed real-number accuracy, can be solved in
O(n log n) time = O(|D|2 log |D|2)
0and0subject to
 
i ji
 xx
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 55 / 64
Problems with linear SVM
What if the decision function is not a linear?
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 56 / 64
Kernel Trick
)2,,(spacein the
1 xxxx
Here,directly!computeeasy tooftenis:thingCool
maximizewant toWe
i ji
  
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 57 / 64
Other Kernels
The polynomial kernel
K(xi,xj) = (xi•xj + 1)p
, where p is a tunable parameter.
Evaluating K only require one addition and one exponentiation
more than the original dot product.
Gaussian kernels (also called radius basis functions)
K(xi,xj) = exp(||xi-xj ||2
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 58 / 64
An example: A botanist really knowing trees.Everytime he sees a new tree,
he claims it is not a tree.
A well known problem with machine learning methods is overtraining.
This means that we have learned the training data very well, but
we can not classify unseen examples correctly.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 59 / 64
Overtraining/overfitting 2
It can be shown that: The portion, n, of unseen data that will be
missclassified is bounded by:
n  Number of support vectors / number of training examples
A measure of the risk of overtraining with SVM (there are also other
Ockham´s razor principle: Simpler system are better than more complex ones.
In SVM case: fewer support vectors mean a simpler representation of the
Example: Understanding a certain cancer if it can be described by one gene
is easier than if we have to describe it with 5000.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 60 / 64
A practical example, protein
• Proteins are synthesized in the cytosol.
• Transported into different subcellular
locations where they carry out their
• Aim: To predict in what location a
certain protein will end up!!!
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 61 / 64
Regression is a statistical technique that uses to establish relationship
between one dependent variable and one or more independent
Bayesian learning uses Bayes theorem to determine the conditional
probability of a hypotheses given some evidence or observations.
SVM is a useful alternative to neural networks.
Two key concepts of SVM: maximize the margin and the kernel trick.
Many SVM implementations are available on the web for you to try
on your data set!
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 62 / 64
[1] Thomas M. Mitchell.
Machine Learning.
McGraw-Hill, Inc., USA, 1 edition, 1997.
[2] Christopher M. Bishop.
Pattern Recognition and Machine Learning.
Springer, 2006.
[3] Stephen Marsland.
Machine Learning: An Algorithmic Perspective, Second Edition.
2nd edition, 2014.
[4] Ethem Alpaydin.
Introduction to Machine Learning.
Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 3 edition, 2014.
[5] B E Boser, I M Guyon, and V N Vapnik.
A training algorithm for optimal margin classifiers.
In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, volume 5, pages 144–152,
Springer-Verlag, Berlin Heidelberg, 1992.
[6] V. Vapnik.
The nature of statistical learning theory.
In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Springer, 1999.
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 63 / 64
Thank You!
Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 64 / 64

More Related Content

What's hot

Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classificationKrish_ver2
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade offVARUN KUMAR
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...Edureka!
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDatamining Tools
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaEdureka!
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes ClassifierYiqun Hu
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsAndrew Ferlitsch
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningDr. Radhey Shyam
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learningbutest
Computational Learning Theory
Computational Learning TheoryComputational Learning Theory
Computational Learning Theorybutest
Decision Trees
Decision TreesDecision Trees
Decision TreesStudent
Introduction to-machine-learning
Introduction to-machine-learningIntroduction to-machine-learning
Introduction to-machine-learningBabu Priyavrat

What's hot (20)

Instance based learning
Instance based learningInstance based learning
Instance based learning
Data Mining
Data MiningData Mining
Data Mining
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | Edureka
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
Computational Learning Theory
Computational Learning TheoryComputational Learning Theory
Computational Learning Theory
Decision Trees
Decision TreesDecision Trees
Decision Trees
Introduction to-machine-learning
Introduction to-machine-learningIntroduction to-machine-learning
Introduction to-machine-learning

Similar to Regression, Bayesian Learning and Support vector machine

SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSScsula its training
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithmsArunangsu Sahu
Supervised WSD Using Master- Slave Voting Technique
Supervised WSD Using Master- Slave Voting TechniqueSupervised WSD Using Master- Slave Voting Technique
Supervised WSD Using Master- Slave Voting Techniqueiosrjce
Naïve Bayes Machine Learning Classification with R Programming: A case study ...
Naïve Bayes Machine Learning Classification with R Programming: A case study ...Naïve Bayes Machine Learning Classification with R Programming: A case study ...
Naïve Bayes Machine Learning Classification with R Programming: A case study ...SubmissionResearchpa
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationIJECEIAES
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...ijistjournal
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
4-ML-UNIT-IV-Bayesian Learning.pptx
4-ML-UNIT-IV-Bayesian Learning.pptx4-ML-UNIT-IV-Bayesian Learning.pptx
4-ML-UNIT-IV-Bayesian Learning.pptxSaitama84
Recommender system
Recommender systemRecommender system
Recommender systemBhumi Patel
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning,  Chapter 2- Supervised LearningCourse Title: Introduction to Machine Learning,  Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning, Chapter 2- Supervised LearningShumet Tadesse
Module 4_F.pptx
Module  4_F.pptxModule  4_F.pptx
Module 4_F.pptxSupriyaN21
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...wajrcs
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...ijistjournal
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals

Similar to Regression, Bayesian Learning and Support vector machine (20)

SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
Supervised WSD Using Master- Slave Voting Technique
Supervised WSD Using Master- Slave Voting TechniqueSupervised WSD Using Master- Slave Voting Technique
Supervised WSD Using Master- Slave Voting Technique
Naïve Bayes Machine Learning Classification with R Programming: A case study ...
Naïve Bayes Machine Learning Classification with R Programming: A case study ...Naïve Bayes Machine Learning Classification with R Programming: A case study ...
Naïve Bayes Machine Learning Classification with R Programming: A case study ...
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classication
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
Ankit presentation
Ankit presentationAnkit presentation
Ankit presentation
4-ML-UNIT-IV-Bayesian Learning.pptx
4-ML-UNIT-IV-Bayesian Learning.pptx4-ML-UNIT-IV-Bayesian Learning.pptx
4-ML-UNIT-IV-Bayesian Learning.pptx
Recommender system
Recommender systemRecommender system
Recommender system
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning,  Chapter 2- Supervised LearningCourse Title: Introduction to Machine Learning,  Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
Module 4_F.pptx
Module  4_F.pptxModule  4_F.pptx
Module 4_F.pptx
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)

More from Dr. Radhey Shyam

KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfDr. Radhey Shyam
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdfSE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdfDr. Radhey Shyam
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfDr. Radhey Shyam
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfDr. Radhey Shyam
Deep-Learning-2017-Lecture5CNN.pptxDr. Radhey Shyam
SE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdfSE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdfDr. Radhey Shyam
Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21Dr. Radhey Shyam
Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021Dr. Radhey Shyam
Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021Dr. Radhey Shyam

More from Dr. Radhey Shyam (20)

KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdfSE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
KCS-055 U5.pdf
KCS-055 U5.pdfKCS-055 U5.pdf
KCS-055 U5.pdf
KCS-055 MLT U4.pdf
KCS-055 MLT U4.pdfKCS-055 MLT U4.pdf
KCS-055 MLT U4.pdf
SE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdfSE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdf
SE UNIT-2.pdf
SE UNIT-2.pdfSE UNIT-2.pdf
SE UNIT-2.pdf
SE UNIT-1 Revised.pdf
SE UNIT-1 Revised.pdfSE UNIT-1 Revised.pdf
SE UNIT-1 Revised.pdf
SE UNIT-3.pdf
SE UNIT-3.pdfSE UNIT-3.pdf
SE UNIT-3.pdf
Ip unit 5
Ip unit 5Ip unit 5
Ip unit 5
Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21
Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021
Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021
Ip unit 1
Ip unit 1Ip unit 1
Ip unit 1
Cc unit 5
Cc unit 5Cc unit 5
Cc unit 5
Cc unit 4 updated version
Cc unit 4 updated versionCc unit 4 updated version
Cc unit 4 updated version

Recently uploaded

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543

Recently uploaded (20)

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)

Regression, Bayesian Learning and Support vector machine

  • 1. Machine Learning Techniques Regression Bayesian Learning Support Vector Machine Dr. Radhey Shyam Professor Dept. of Computer Science & Engg., BIET Lucknow Following slides have been prepared by Radhey Shyam, with grateful acknowledgement of others who made their course contents freely available. Feel free to reuse these slides for your own academic purposes. Please send feedback at Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 1 / 64
  • 2. OutlineOutline 1 Regression Simple Linear Regression Logistic Regression 2 Bayesian Learning Bayes Theorem Concept Learning Bayes Optimal Classifier NAIVE BAYES CLASSIFIER BAYESIAN BELIEF NETWORKS EM Algorithm 3 Support Vector Machine 4 Conclusion 5 References Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 2 / 64
  • 3. RegressionRegression Linear Regression— In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression. Linear regression is used to predict the continuous dependent variable using a given set of independent variables. Linear Regression is used for solving Regression problem. In Linear regression, value of continuous variables are predicted. Linear regression tried to find the best fit line, through which the output can be easily predicted. Least square estimation method1 is used for estimation of accuracy2. The output for Linear Regression must be a continuous value, such as price, age, etc. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 3 / 64
  • 4. RegressionRegression (cont.) In Linear regression, it is required that relationship between dependent variable and independent variable must be linear. In linear regression, there may be collinearity3 between the independent variables. Some Regression examples: Regression analysis is used in stats to find trends in data. For example, you might guess that there is a connection between how much you eat and how much you weigh; regression analysis can help you quantify that. Regression analysis will provide you with an equation for a graph so that you can make predictions about your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll weigh in ten years time if you continue to put on weight at the same rate. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 4 / 64
  • 5. RegressionRegression (cont.) It is also called simple linear regression. It establishes the relationship between two variables using a straight line. If two or more explanatory variables have a linear relationship with the dependent variable, the regression is called a multiple linear regression. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 5 / 64
  • 6. RegressionRegression (cont.) Logistic Regression— use to resolve classification problems where given an element you have to classify the same in N categories. Typical examples are for example given a mail to classify it as spam or not, or given a vehicle find to which category it belongs (car, truck, van, etc.). That’s basically the output is a finite set of descrete values. Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables. Logistic regression is used for solving Classification problems. In logistic Regression, we predict the values of categorical variables. In Logistic Regression, we find the S-curve by which we can classify the samples. Maximum likelihood estimation method is used for estimation of accuracy. The output of Logistic Regression must be a Categorical value such as 0 or 1, Yes or No, etc. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 6 / 64
  • 7. RegressionRegression (cont.) In Logistic regression, it is not required to have the linear relationship between the dependent and independent variable. In logistic regression, there should not be collinearity between the independent variable. 1 The least squares method is a statistical procedure to find the best fit for a set of data points by minimizing the sum of the offsets of points from the plotted curve. Least squares regression is used to predict the behavior of dependent variables. 2 Accuracy is how close a measured value is to the actual value. Precision is how close the measured values are to each other. 3 Collinearity is a condition in which some of the independent variables are highly correlated. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 7 / 64
  • 8. Bayesian LearningBayesian Learning Bayesian Decision Theory came long before Version Spaces, Decision Tree Learning and Neural Networks. It was studied in the field of Statistical Theory and more specifically, in the field of Pattern Recognition. Bayesian Decision Theory is at the basis of important learning schemes such as the Na¨ıve Bayes Classifier, Learning Bayesian Belief Networks and the EM Algorithm. Bayesian Decision Theory is also useful as it provides a framework within which many non-Bayesian classifiers can be studied. Bayes Theorem Goal — To determine the most probable hypothesis, given the data D plus any initial knowledge about the prior probabilities of the various hypotheses in H. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 8 / 64
  • 9. Bayesian LearningBayesian Learning (cont.) Prior probability of h, P(h) — it reflects any background knowledge we have about the chance that h is a correct hypothesis (before having observed the data). Prior probability of D, P(D) — it reflects the probability that training data D will be observed given no knowledge about which hypothesis h holds. Conditional Probability of observation D, P(D|h) — it denotes the probability of observing data D given some world in which hypothesis h holds. Posterior probability of h, P(h|D) — it represents the probability that h holds given the observed training data D. It reflects our confidence that h holds after we have seen the training data D and it is the quantity that Machine Learning researchers are interested in. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 9 / 64
  • 10. Bayesian LearningBayesian Learning (cont.) Bayes Theorem allows us to compute P(h|D) — P(h|D) = P(D|h)P(h)/P(D) Maximum A Posteriori (MAP) Hypothesis and Maximum Likelihood Goal — To find the most probable hypothesis h from a set of candidate hypotheses H given the observed data D. MAP Hypothesis, hMAP = argmax h∈H P(h|D) = argmax h∈H P(D|h)P(h)/P(D) = argmax h∈H P(D|h)P(h) Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 10 / 64
  • 11. Bayesian LearningBayesian Learning (cont.) If every hypothesis in H is equally probable a priori, we only need to consider the likelihood of the data D given h, P(D|h). Then, hMAP becomes the Maximum Likelihood, hML = argmax h∈H P(D|h)P(h) Concept Learning Inducing general functions from specific training examples is a main issue of machine learning. Concept Learning: Acquiring the definition of a general category from given sample positive and negative training examples of the category. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 11 / 64
  • 12. Bayesian LearningBayesian Learning (cont.) Concept Learning can seen as a problem of searching through a predefined space of potential hypotheses for the hypothesis that best fits the training examples. The hypothesis space has a general-to-specific ordering of hypotheses, and the search can be efficiently organized by taking advantage of a naturally occurring structure over the hypothesis space. A Formal Definition for Concept Learning: Inferring a boolean-valued function from training examples of its input and output. An example for concept-learning is the learning of bird-concept from the given examples of birds (positive examples) and non-birds (negative examples). We are trying to learn the definition of a concept from given examples. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 12 / 64
  • 13. Bayesian LearningBayesian Learning (cont.) A set of example days, and each is described by six attributes. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 13 / 64
  • 14. Bayesian LearningBayesian Learning (cont.) The task is to learn to predict the value of EnjoySport for arbitrary day, based on the values of its attribute values. EnjoySport – Hypothesis Representation Each hypothesis consists of a conjuction of constraints on the instance attributes. Each hypothesis will be a vector of six constraints, specifying the values of the six attributes – (Sky, AirTemp, Humidity, Wind, Water, and Forecast). Each attribute will be: ? - indicating any value is acceptable for the attribute (don’t care) singlevalue – specifying a single required value (ex. Warm) (specific) φ - indicating no value is acceptable for the attribute (no value) Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 14 / 64
  • 15. Bayesian LearningBayesian Learning (cont.) Hypothesis Representation A hypothesis: Sky AirTemp Humidity Wind Water Forecast < Sunny, ?, ?, Strong, ?, Same > The most general hypothesis – that every day is a positive example <?, ?, ?, ?, ?, ? > The most specific hypothesis – that no day is a positive example < φ, φ, φ, φ, φ, φ > EnjoySport concept learning task requires learning the sets of days for which EnjoySport = yes, describing this set by a conjunction of constraints over the instance attributes. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 15 / 64
  • 16. Bayesian LearningBayesian Learning (cont.) EnjoySport Concept Learning Task Given Instances X : set of all possible days, each described by the attributes Sky – (values: Sunny, Cloudy, Rainy) AirTemp – (values: Warm, Cold) Humidity – (values: Normal, High) Wind – (values: Strong, Weak) Water – (values: Warm, Cold) Forecast – (values: Same, Change) TargetConcept(Function) c : EnjoySport : X → {0, 1} Hypotheses H : Each hypothesis is described by a conjunction of constraints on the attributes. TrainingExamples D: positive and negative examples of the target function Determine A hypothesis h in H such that h(x) = c(x) for all x in D. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 16 / 64
  • 17. Bayesian LearningBayesian Learning (cont.) The Inductive Learning Hypothesis Although the learning task is to determine a hypothesis h identical to the target concept cover the entire set of instances X, the only information available about c is its value over the training examples. Inductive learning algorithms can at best guarantee that the output hypothesis fits the target concept over the training data. Lacking any further information, our assumption is that the best hypothesis regarding unseen instances is the hypothesis that best fits the observed training data. This is the fundamental assumption of inductive learning. The Inductive Learning Hypothesis - Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 17 / 64
  • 18. Bayesian LearningBayesian Learning (cont.) Concept Learning as Search Concept learning can be viewed as the task of searching through a large space of hypotheses implicitly defined by the hypothesis representation. The goal of this search is to find the hypothesis that best fits the training examples. By selecting a hypothesis representation, the designer of the learning algorithm implicitly defines the space of all hypotheses that the program can ever represent and therefore can ever learn. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 18 / 64
  • 19. Bayesian LearningBayesian Learning (cont.) Bayes Optimal Classifier We have considered the question so far “what is the most probable hypothesis given the training data?” In fact, the question that is often of most significance is the closely related question “what is the most probable classification of the new instance given the training data?” Although it may seem that this second question can be answered by simply applying the MAP hypothesis to the new instance, in fact it is possible to do better. To develop some intuitions consider a hypothesis space containing three hypotheses, h1, h2, and h3. Suppose that the posterior probabilities of these hypotheses given the training data are .4, .3, and .3 respectively. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 19 / 64
  • 20. Bayesian LearningBayesian Learning (cont.) Thus, h1 is the MAP hypothesis. Suppose a new instance x is encountered, which is classified positive by h1 , but negative by h2 and h3. Taking all hypotheses into account, the probability that x is positive is .4 (the probability associated with h1), and the probability that it is negative is therefore .6. The most probable classification (negative) in this case is different from the classification generated by the MAP hypothesis. In general, the most probable classification of the new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities. If the possible classification of the new example can take on any value vj from some set V , then the probability P(vj|D) that the correct classification for the new instance is v;, is just P(vj|D). Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 20 / 64
  • 21. Bayesian LearningBayesian Learning (cont.) The optimal classification of the new instance is the value vj, for which P(vj|D) is maximum. P(Vj|D) = hi∈H P(vj|hi)P(hi|D) The optimal classification of the new instance is the value vj, for which P(vj|D) is maximum. P(Vj|D) = argmax vj∈V hi∈H P(vj|hi)P(hi|D) (1) Any system that classifies new instances according to Equation (1) is called a Bayes optimal classifier, or Bayes optimal learner. No other classification method using the same hypothesis space and same prior knowledge can outperform. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 21 / 64
  • 22. Bayesian LearningBayesian Learning (cont.) NAIVE BAYES CLASSIFIER One highly practical Bayesian learning method is the naive Bayes learner, often called the naive Bayes classifier. In some domains its performance has been shown to be comparable to that of neural network and decision tree learning. This section introduces the naive Bayes classifier; the next section applies it to the practical problem of learning to classify natural language text documents. The naive Bayes classifier applies to learning tasks where each instance x is described by a conjunction of attribute values and where the target function f(x) can take on any value from some finite set V . A set of training examples of the target function is provided, and a new instance is presented, described by the tuple of attribute values < a1, >. The learner is asked to predict the target value, or classification, for this new instance. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 22 / 64
  • 23. Bayesian LearningBayesian Learning (cont.) The Bayesian approach to classifying the new instance is to assign the most probable target value, vMAP , given the attribute values < a1, > that describe the instance. vMAP = argmax vj∈V P(vj|a1, We can use Bayes theorem to rewrite this expression as vMAP = argmax vj∈V P(a1,|vj)P(vj) P(a1, (2) = argmax vj∈V P(a1,|vj)P(vj) (3) Now we could attempt to estimate the two terms in Equation (3) based on the training data. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 23 / 64
  • 24. Bayesian LearningBayesian Learning (cont.) It is easy to estimate each of the P(vj) simply by counting the frequency with which each target value vj occurs in the training data. However, estimating the different P(a1,|vj) terms in this fashion is not feasible unless we have a very large set of training data. The problem is that the number of these terms is equal to the number of possible instances times the number of possible target values. Therefore, we need to see every instance in the instance space many times in order to obtain reliable estimates. The naive Bayes classifier is based on the simplifying assumption that the attribute values are conditionally independent given the target value. In other words, the assumption is that given the target value of the instance, the probability of observing the conjunction a1, is just the product of the probabilities for the individual attributes: P(a1,|vj) = i P(ai|vj). Substituting this into Equation (3). Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 24 / 64
  • 25. Bayesian LearningBayesian Learning (cont.) we have the approach used by the naive Bayes classifier. vNB = argmax vj∈V P(vj) i P(ai|vj) (4) where vNB denotes the target value output by the naive Bayes classifier. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 25 / 64
  • 26. Bayesian LearningBayesian Learning (cont.) BAYESIAN BELIEF NETWORKS Abbreviation : BBN (Bayesian Belief Network) Synonyms: Bayes (ian) network, Bayes(ian) model, Belief network, Decision network, or probabilistic directed acyclic graphical model. A BBN is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a Directed Acyclic Graph (DAG). Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 26 / 64
  • 27. Bayesian LearningBayesian Learning (cont.) BBNs enable us to model and reason about uncertainty. BBNs accommodate both subjective probabilities and probabilities based on objective data. The most important use of BBNs is in revising probabilities in the light of actual observations of events. Nodes represent variables in the Bayesian sense: observable quantities, hidden variables or hypotheses. Edges represent conditional dependencies. Each node is associated with a probability function that takes, as input, a particular set of probabilities for values for the node’s parent variables, and outputs the probability of the values of the variable represented by the node. Prior Probabilities: e.g. P(RAIN) Conditional Probabilities: e.g. P(SPRINKLER | RAIN) Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 27 / 64
  • 28. Bayesian LearningBayesian Learning (cont.) Joint Probability Function: P(GRASS WET, SPRINKLER, RAIN) = P(GRASS WET | RAIN, SPRINKLER) * P(SPRINKLER | RAIN) * P ( RAIN) Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 28 / 64
  • 29. Bayesian LearningBayesian Learning (cont.) Typically the probability functions are described in table form. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 29 / 64
  • 30. Bayesian LearningBayesian Learning (cont.) EM Algorithm The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin. • EM is typically used to compute maximum likelihood estimates given incomplete samples. • The EM algorithm estimates the parameters of a model iteratively. – Starting from some initial guess, each iteration consists of  an E step (Expectation step)  an M step (Maximization step) Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 30 / 64
  • 31. Bayesian LearningBayesian Learning (cont.) 600.465 - Intro to NLP - J. Eisner 4 Guess of unknown parameters (probabilities) initial guess M step Observed structure (words, ice cream) General Idea Guess of unknown hidden structure (tags, parses, weather) E step Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 31 / 64
  • 32. Bayesian LearningBayesian Learning (cont.) A general Statement   Consider a sample (X1 ...Xn ) which is drawn from a probability distribution P(X|A) where A are parameters. If the Xs are independent with probability density function P(Xi |A) the joint probability of the whole set is   A)|XP(=A)|X...XP( i n =1i n1    this may be maximised with respect to A to give the maximum likelihood estimates. MLE(Maximum likelihood estimation) Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 32 / 64
  • 33. Bayesian LearningBayesian Learning (cont.) • Given – A sample X={X1, …, Xn} – A vector of parameters θ • We define – Likelihood of the data: P(X | θ) – Log-likelihood of the data: L(θ)=log P(X|θ) • Given X, find )(maxarg   LML   MLE(Maximum likelihood estimation) Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 33 / 64
  • 34. Bayesian LearningBayesian Learning (cont.) Basic setting in EM • X is a set of data points: observed data • Θ is a parameter vector. • EM is a method to find θML where • Calculating P(X | θ) directly is hard. • Calculating P(X,Y|θ) is much simpler, where Y is “hidden” data (or “missing” data). Θ)|P(X L(Θ(=θ Ωθ ΩθML log xam gra xam gra   Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 34 / 64
  • 35. Bayesian LearningBayesian Learning (cont.) The basic EM strategy • Z = (X, Y) – Z: complete data (“augmented data”) – X: observed data (“incomplete” data) – Y: hidden data (“missing” data) Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 35 / 64
  • 36. Bayesian LearningBayesian Learning (cont.) The “missing” data Y • Y need not necessarily be missing in the practical sense of the word. • It may just be a conceptually convenient technical device to simplify the calculation of P(x |θ). • There could be many possible Ys. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 36 / 64
  • 37. Bayesian LearningBayesian Learning (cont.) The EM algorithm • Start with initial estimate, θ0 • Repeat until convergence – E-step: calculate – M-step: find ),(maxarg)1( tt Q    θ)|y,P(x)θ,x|P(y=)θQ(θ( i t n =i y i t log 1  Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 37 / 64
  • 38. Bayesian LearningBayesian Learning (cont.) The Q-function The Q-function is the expected value of the complete data log-likelihood P(X,Y| θ) with respect to Y given X and θt . & where, – Y is a random vector. – X=(x1, x2, …, xn) is a constant (vector). – Θt is the current parameter estimate and is a constant (vector). – Θ is the normal variable (vector) that we wish to adjust. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 38 / 64
  • 39. Bayesian LearningBayesian Learning (cont.) Expectation Maximization EM When to use  data is only partially observable  unsupervised clustering: target value unobservable  supervised learning: some instance attributes unobservable Applications  training Bayesian Belief Networks  unsupervised clustering  learning hidden Markov models Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 39 / 64
  • 40. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE History of SVM SVM is related to statistical learning theory. SVM was first introduced in 1992. SVM becomes popular because of its success in handwritten digit recognition 1.1% test error rate for SVM. This is the same as the error rates of a carefully constructed neural network. SVM is now regarded as an important example of “kernel methods”, one of the key area in machine learning Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 40 / 64
  • 41. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Binary Classification Given training data (xi, yi) for i = 1 . . . N, with xi ∈ Rd and yi ∈ {−1, 1}, learn a classifier f(x) such that f(xi) ( ≥ 0 yi = +1 < 0 yi = −1 i.e. yif(xi) > 0 for a correct classification. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 41 / 64
  • 42. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Linear separability linearly separable not linearly separable Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 42 / 64
  • 43. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Linear classifiers X2 X1 A linear classifier has the form • in 2D the discriminant is a line • is the normal to the line, and b the bias • is known as the weight vector f(x) = 0 f(x) = w>x + b f(x) > 0f(x) < 0 Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 43 / 64
  • 44. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Linear classifiers A linear classifier has the form • in 3D the discriminant is a plane, and in nD it is a hyperplane For a K-NN classifier it was necessary to `carry’ the training data For a linear classifier, the training data is used to learn w and then discarded Only w is needed for classifying new data f(x) = 0 f(x) = w>x + b Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 44 / 64
  • 45. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Given linearly separable data xi labelled into two categories yi = {-1,1} , find a weight vector w such that the discriminant function separates the categories for i = 1, .., N • how can we find this separating hyperplane ? The Perceptron Classifier f(xi) = w>xi + b The Perceptron Algorithm Write classifier as • Initialize w = 0 • Cycle though the data points { xi, yi } • if xi is misclassified then • Until all the data is correctly classified w ← w + α sign(f(xi)) xi f(xi) = ˜w>˜xi + w0 = w>xi where w = (˜w, w0), xi = (˜xi, 1) Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 45 / 64
  • 46. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) For example in 2D X2 X1 X2 X1 w before update after update w NB after convergence w = PN i αixi • Initialize w = 0 • Cycle though the data points { xi, yi } • if xi is misclassified then • Until all the data is correctly classified w ← w + α sign(f(xi)) xi xi Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 46 / 64
  • 47. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) • if the data is linearly separable, then the algorithm will converge • convergence can be slow … • separating line close to training data • we would prefer a larger margin for generalization -15 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 Perceptron example Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 47 / 64
  • 48. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) What is the best w? • maximum margin solution: most stable under perturbations of the inputs Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 48 / 64
  • 49. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Tennis example Humidity Temperature = play tennis = do not play tennis Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 49 / 64
  • 50. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Linear Support Vector Machines x1 x2 =+1 =-1 Data: <xi,yi>, i=1,..,l xi  Rd yi  {-1,+1} Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 50 / 64
  • 51. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) =-1 =+1 Data: <xi,yi>, i=1,..,l xi  Rd yi  {-1,+1} All hyperplanes in Rd are parameterize by a vector (w) and a constant b. Can be expressed as w•x+b=0 (remember the equation for a hyperplane from algebra!) Our aim is to find such a hyperplane f(x)=sign(w•x+b), that correctly classify our data. f(x) Linear SVM 2 Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 51 / 64
  • 52. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) d+ d- Definitions Define the hyperplane H such that: xi•w+b  +1 when yi =+1 xi•w+b  -1 when yi =-1 d+ = the shortest distance to the closest positive point d- = the shortest distance to the closest negative point The margin of a separating hyperplane is d+ + d- . H H1 and H2 are the planes: H1: xi•w+b = +1 H2: xi•w+b = -1 The points on the planes H1 and H2 are the Support Vectors H1 H2 Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 52 / 64
  • 53. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Maximizing the margin d+ d- We want a classifier with as big margin as possible. Recall the distance from a point(x0,y0) to a line: Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2 +B2 ) The distance between H and H1 is: |w•x+b|/||w||=1/||w|| The distance between H1 and H2 is: 2/||w|| In order to maximize the margin, we need to minimize ||w||. With the condition that there are no datapoints between H1 and H2: xi•w+b  +1 when yi =+1 xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1 H1 H2 H Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 53 / 64
  • 54. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Constrained Optimization Problem    0and0subject to 2 1 Maximize :yieldsgsimplifyinand,intobackngsubstituti0,to themsettings,derivativetheTaking0.bemustandboth respectwithofderivativepartialtheextremum,At the 1)(|||| 2 1 ),,( where,),,(infmaximize:methodLagrangian allfor1)(subject to||||Minimize ,         i i ii i ji jijijii i iii ii y yy L b L bybL bL iby     xx w wxww w wxwww w Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 54 / 64
  • 55. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Quadratic Programming • Why is this reformulation a good thing? • The problem is an instance of what is called a positive, semi-definite programming problem • For a fixed real-number accuracy, can be solved in O(n log n) time = O(|D|2 log |D|2) 0and0subject to 2 1 Maximize ,      i i ii i ji jijijii y yy   xx Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 55 / 64
  • 56. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Problems with linear SVM =-1 =+1 What if the decision function is not a linear? Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 56 / 64
  • 57. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Kernel Trick )2,,(spacein the separablelinearlyarepointsData 21 2 2 2 1 xxxx 2 , ),( Here,directly!computeeasy tooftenis:thingCool )()(),(Define )()( 2 1 maximizewant toWe jiji jiji i ji jijijii K K FFK FFyy xxxx xxxx xx      Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 57 / 64
  • 58. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Other Kernels The polynomial kernel K(xi,xj) = (xi•xj + 1)p , where p is a tunable parameter. Evaluating K only require one addition and one exponentiation more than the original dot product. Gaussian kernels (also called radius basis functions) K(xi,xj) = exp(||xi-xj ||2 /22 ) Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 58 / 64
  • 59. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Overtraining/overfitting =-1 =+1 An example: A botanist really knowing trees.Everytime he sees a new tree, he claims it is not a tree. A well known problem with machine learning methods is overtraining. This means that we have learned the training data very well, but we can not classify unseen examples correctly. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 59 / 64
  • 60. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) Overtraining/overfitting 2 It can be shown that: The portion, n, of unseen data that will be missclassified is bounded by: n  Number of support vectors / number of training examples A measure of the risk of overtraining with SVM (there are also other measures). Ockham´s razor principle: Simpler system are better than more complex ones. In SVM case: fewer support vectors mean a simpler representation of the hyperplane. Example: Understanding a certain cancer if it can be described by one gene is easier than if we have to describe it with 5000. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 60 / 64
  • 61. SUPPORT VECTOR MACHINESUPPORT VECTOR MACHINE (cont.) A practical example, protein localization • Proteins are synthesized in the cytosol. • Transported into different subcellular locations where they carry out their functions. • Aim: To predict in what location a certain protein will end up!!! Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 61 / 64
  • 62. ConclusionConclusion Regression is a statistical technique that uses to establish relationship between one dependent variable and one or more independent variables. Bayesian learning uses Bayes theorem to determine the conditional probability of a hypotheses given some evidence or observations. SVM is a useful alternative to neural networks. Two key concepts of SVM: maximize the margin and the kernel trick. Many SVM implementations are available on the web for you to try on your data set! Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 62 / 64
  • 63. ReferencesReferences [1] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., USA, 1 edition, 1997. [2] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [3] Stephen Marsland. Machine Learning: An Algorithmic Perspective, Second Edition. 2nd edition, 2014. [4] Ethem Alpaydin. Introduction to Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 3 edition, 2014. [5] B E Boser, I M Guyon, and V N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, volume 5, pages 144–152, Springer-Verlag, Berlin Heidelberg, 1992. [6] V. Vapnik. The nature of statistical learning theory. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Springer, 1999. Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 63 / 64
  • 64. Thank You! Dr. Radhey Shyam (BIET Lucknow-AKTU) Machine Learning Techniques September 29, 2020 64 / 64