SlideShare a Scribd company logo
1 of 73
MACHINE LEARNING
(20BT60501)
COURSE DESCRIPTION:
Concept learning, General to specific ordering, Decision tree
learning, Support vector machine, Artificial neural networks,
Multilayer neural networks, Bayesian learning, Instance based
learning, reinforcement learning.
Subject :MACHINE LEARNING -(20BT60501)
Topic: Unit IV – BAYESIAN LEARNING
Prepared By:
Dr.J.Avanija
Professor
Dept. of CSE
Sree Vidyanikethan Engineering College
Tirupati.
Unit IV – BAYESIAN LEARNING
Bayes Theorem and Concept Learning
Maximum likelihood and least-squared error hypothesis
Maximum likelihood hypotheses for predicting probabilities
Minimum Description Length principle
Bayes optimal classifier
Gibbs algorithm
Naive Bayes classifier
An Example – Learning to classify text
Bayesian belief networks
EM Algorithm
Introduction
4
Bayesian Learning provides a probabilistic approach to inference.
It is based on the assumption that the quantities of interest are
governed by probability distributions and that optimal decisions
can be made by reasoning about these probabilities together
with observed data.
Bayesian learning algorithms calculate explicit probabilities for
hypotheses.
Features of Bayesian Learning Methods
5
Each observed training example can incrementally decrease or
increase the estimated probability that a hypothesis is correct. This
provides a more flexible approach to learning than algorithms that
completely eliminate a hypothesis if it is found to be inconsistent
with any single example.
Prior knowledge can be combined with observed data to determine
the final probability of a hypothesis. In Bayesian learning, prior knowledge
is provided by asserting
(1) a prior probability for each candidate hypothesis, and
(2) a probability distribution over observed data for each
possible hypothesis.
Bayesian methods can accommodate hypotheses that make
probabilistic predictions (e.g., hypotheses such as "this pneumonia
patient has a 93% chance of complete recovery").
Features of Bayesian Learning Methods
6
New instances can be classified by combining the predictions of
multiple hypotheses, weighted by their probabilities.
They require initial knowledge of many probabilities. When these
probabilities are not known in advance they are often estimated based on
background knowledge, previously available data, and assumptions
about the form of the underlying distributions.
They require significant computational cost.
They can provide a standard of optimal decision making against
which other practical methods can be measured.
Bayes Theorem
7
Terminology
H is Hypothesis Space; h is hypothesis.
D is training dataset ; d is a training sample.
P(h) denotes the prior probability that hypothesis h holds, before training
data is observed.
It reflects any background knowledge we have about the chance that h is a
correct hypothesis. If we have no such prior knowledge, then we might
simply assign the same prior probability to each candidate hypothesis.
P(D) denotes the prior probability that training data D will be observed
(i.e., the probability of D given no knowledge about which hypothesis holds).
P(D | h) denotes the probability of observing data D given hypothesis h
holds.
P (h | D) denotes posterior probability of h that h holds given the observed
training data D.
Bayes Theorem
Maximum a Posteriori (MAP) Hypothesis
8
For a given problem there can be multiple hypothesis statements that are true.
The most (maximally) probable hypothesis for the given data D is called as
the MAP.
Assuming that P(h) is same for all hypothesis, P(D l h) is often called the
likelihood of the data D given h, and any hypothesis that maximizes P(D l h)
is called a maximum likelihood (ML) hypothesis.
P(D) is ignored
since it is
common for all
hypothesis
Bayes Rule - Medical Diagnosis Problem
9
Two alternative hypotheses:
(1) that the patient has a particular form of cancer. Hypothesis h = cancer
(2) that the patient does not have cancer. Hypothesis -h = No cancer
Dataset contains patients data belonging to two classes : positive class +
negative class -
Prior Knowledge: over the entire population of people only .008 have this
disease.
P(h) = 0.08 P ( - h) = 0.992
The test returns a correct positive result in only 98% of the cases in which the
disease is actually present.
P( + | h) = 0.98 P( - | h) = 0.02
A correct negative result in only 97% of the cases in which the disease is not
present.
P( + | -h) = 0.03 P( - | -h) = 0.97
Bayes Rule - Medical Diagnosis Problem…
10
Suppose we now observe a new patient for whom the lab test returns a positive
result. Should we diagnose the patient as having cancer or not? The maximum a
posteriori hypothesis can be found as follows:
P( + | h) = 0.98 P(h) = 0.08 P( h | D ) = 0.98 x 0.08 = 0.0078
P( + | -h) = 0.03 P(-h) = 0.992 P ( -h | D ) = 0.03 x 0.992 = 0.298
hMAP = P ( -h | D )
There is no cancer for the patient. The lab test might be false positive.
Bayes Theorem and Concept Learning
11
Bayes theorem can be used as a basis for a straight forward learning algorithm
since that calculates the probability of hypothesis and then outputs probable
one.
Brute-force Bayesian learning algorithm compared with concept learning
algorithms.
Summary of basic probability formulas:
Bayes Theorem and Concept Learning...
12
Brute-Force MAP Learning Algorithm
Concept learning algorithm to output Maximum a Posterori(MAP) hypothesis
based on Bayes theorem:
Bayes Theorem and Concept Learning...
13
Brute-Force MAP Learning Algorithm
This algorithm may require significant computation because it applies Bayes
theorem to each hypothesis in H to calculate p(h|D)
This is impractical for large hypothesis spaces
The algorithm is still of interest because it provides standard against which
we may judge the performance of other concept learning algorithms.
Brute-Force MAP algorithm must specify values for P(h) and P(D|h)
P(h) and P(D|h) can be chosen to be consistent with the following assumptions:
Bayes Theorem and Concept Learning...
14
Brute-Force MAP Learning Algorithm
Given these assumptions what values should we specify for P(h).
Given no prior knowledge that one hypothesis is more likely than another, it is
reasonable to assign the same prior probability to every hypothesis h in H.
Furthermore, because we assume the target concept is contained in H we should
require that these prior probabilities sum to 1.
Together these constraints imply that we should choose
Bayes Theorem and Concept Learning...
15
Brute-Force MAP Learning Algorithm
What choice shall we make for P(D|h)?
P(D|h) is the probability of observing the target values D=<d1,…dm> for the
fixed set of instances <X1,X2,…Xm>, given a world in which h holds(h is the
correct description of target c)
Since we assume noise free training data , the probability of observing
classification di given h is just 1 if di=h(xi) and 0 if di!=h(xi)
Therefore,
The probability of data D given hypothesis h is 1 if D is consistent with h
otherwise 0.
Bayes Theorem and Concept Learning...
16
Brute-Force MAP Learning Algorithm
Let us consider the first step of the algorithm which uses Bayes theorem to
compute the posterior probability P(h|D) of each hypothesis h given the
observed training Data D.
First, consider the case where h is inconsistent with the training Data D.
We know that P(D|h) to be 0 when h is inconsistent with D we have,
Bayes Theorem and Concept Learning...
17
Brute-Force MAP Learning Algorithm
Now consider the case where h is consistent with D.
We know that P(D|h) to be 1 when h is consistent with D we have
Subset of hypothesis consistent with D-
Version space of H with respect to D
Bayes Theorem and Concept Learning...
18
Brute-Force MAP Learning Algorithm
To summarize Bayes theorem implies that the posterior probability P(h|D)
under our assumed P(h) and P(D|h) is:
Bayes Theorem and Concept Learning...
19
MAP Hypothesis and Consistent Learners
Every hypothesis consistent with D is a Maximum A Posteriori (MAP)
hypothesis.
This statement translates into statement about general class of learners called
as consistent learners.
A learning algorithm is termed as a consistent learner provided it outputs a
hypothesis that commits zero errors over the training examples.
Every consistent learner outputs a MAP hypothesis if we assume a uniform
prior probability distribution over H, deterministic, noise free training data
Find-S searches the hypothesis space H from specific to general hypothesis,
outputting a maximally specific consistent hypothesis.
Output hypothesis will be MAP hypothesis.
Bayesian framework can be used to characterize the behavior of learning
algorithms (Find-S) even when the learning algorithm does not explicitly
manipulate probabilities.
Bayes Theorem and Concept Learning...
20
Evolution of Probabilities Associated with Hypotheses
Figure (a) all hypotheses have the same probability.
Figures (b) and (c), As training data accumulates, the posterior
probability for inconsistent hypotheses becomes zero while the total
probability summing to 1 is shared equally among the remaining consistent
hypotheses.
Bayes Theorem and Concept Learning...
21
MAP Hypothesis and Consistent Learners
Many learning approaches such as neural network learning, linear regression,
polynomial curve fitting try to learn a continuous valued target function.
Not possible to use Bayesian learning for continuous valued target functions.
Under a certain assumptions any learning algorithm that minimizes the
squared error between the output hypothesis predictions and training data
will output a MAXIMUM LIKELIHOOD HYPOTHESIS.
The significance of this result is that it provides a Bayesian justification (under
certain assumptions) for many neural network and other curve fitting methods that
attempts to minimize the sum of squared errors over training data.
Maximum Likelihood and Least Squared Error
Hypothesis
22
Consider the problem of learning a continuous-valued target function such as
neural network learning, linear regression, and polynomial curve fitting
A straightforward Bayesian analysis will show that under certain assumptions any
learning algorithm that minimizes the squared error between the output
hypothesis predictions and the training data will output a maximum likelihood
(ML) hypothesis.
Maximum Likelihood and Least Squared Error
Hypothesis…
23
Learner L considers an instance space X and a hypothesis space H consisting
of some class of real-valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R]
and training examples of the form <xi,di>
 The problem faced by L is to learn an unknown target function f : X → R
A set of m training examples is provided, where the target value of each
example is corrupted by random noise drawn according to a Normal
probability distribution with zero mean (di = f(xi) + ei)
Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
Here f(xi) is the noise-free value of the target function and ei is a random
variable representing the noise.
It is assumed that the values of the ei are drawn independently and that they are
distributed according to a Normal distribution with zero mean.
The task of the learner is to output a maximum likelihood hypothesis or a
MAP hypothesis assuming all hypotheses are equally probable a priori.
Maximum Likelihood and Least Squared Error
Hypothesis…
24
Normal distribution function represented as
A normal distribution (also known as Gaussian, Gauss, or Laplace–
Gauss distribution) is a type of continuous probability distribution for a real-
valued random variable. The general form of its probability density function is
The parameter µ is the mean or expectation of the distribution (and also
its median and mode), while the parameter σ is its standard deviation.
The variance of the distribution is σ2
A random variable with a Gaussian distribution is said to be normally
distributed, and is called a normal deviate.
Maximum Likelihood and Least Squared Error
Hypothesis…
25
Data can be "distributed" (spread out) in different ways.
Left Skewed Right Skewed
Data can be "distributed" (spread out) in different ways.
Maximum Likelihood and Least Squared Error
Hypothesis…
26
There are many cases where the data tends to be around a central value with
no bias left or right, and it gets close to a "Normal Distribution" like this:
The normal distribution, also known as the Gaussian distribution, is the most
important probability distribution in statistics for independent, random
variables.
Many things closely follow a Normal Distribution:
heights of people
size of things produced by machines
errors in measurements
blood pressure
marks on a test
Maximum Likelihood and Least Squared Error
Hypothesis…
27
In order to find the maximum likelihood hypothesis in Bayesian learning for
continuous valued target function we start with maximum likelihood hypothesis
definition by lower case p to refer the probability density function.
We assume a fixed set of training instances (x1,…xm) and therefore consider
the data D to be the corresponding sequence of target values D=(d1,…dm)
Here we can write p(D|h) as the product of various p(di|h)
Substitute for p(di|h)
Maximum Likelihood and Least Squared Error
Hypothesis…
28
We now apply the transformation that is common in maximum likelihood
calculations.
Rather than maximizing the above complicated expression choose to
maximize its less complicated logarithm.
This is justified because ln(p)is a monotonic function of p, therefore
maximizing ln(p) also maximizes p.
Given the noise ei obeys a Normal distribution with zero mean and
unknown variance σ2 , each di must also obey a Normal distribution
around the true target value f(xi). Because we are writing the expression
for P(D|h), we assume h is the correct description of f.
Hence, µ = f(xi) = h(xi)
Maximum Likelihood and Least Squared Error
Hypothesis…
29
First term is a constant discard it.
Maximizing the negative quantity is equivalent to minimizing the
corresponding positive quantity.
Finally we can again discard the constants that are independent of h.
Maximum Likelihood and Least Squared Error
Hypothesis…
30
Above equation shows that the maximum likelihood hypothesis
hML is the one that minimizes the sum of the squared errors
between the observed training values di and the hypothesis
predictions h(xi)
Maximum Likelihood Hypothesis for Predicting
Probabilities
31
Consider the setting in which we wish to learn a nondeterministic (probabilistic)
function f : X → {0, 1}, which has two discrete output values.
We want a function approximator whose output is the probability that f(x) = 1 In
other words learn the target function f ` : X → [0, 1] such that f (x) = P(f(x) = 1)
How can we learn f ` using a neural network?
Use of brute force way would be to first collect the observed frequencies of 1's
and 0's for each possible value of x and to then train the neural network to output
the target frequency for each x.
Maximum Likelihood Hypothesis for Predicting
Probabilities…
32
What criterion should we optimize in order to find a maximum likelihood
hypothesis for f' in this setting?
First obtain an expression for P(D|h)
Assume the training data D is of the form D = {(x1, d1) . . . (xm, dm)}, where
di is the observed 0 or 1 value for f (xi).
Both xi and di as random variables, and assuming that each training example is
drawn independently, we can write P(D|h) as
Applying the product rule
Maximum Likelihood Hypothesis for Predicting
Probabilities…
33
The probability P(di|h, xi)
Re-express it in a more mathematically manipulable form, as
Substitute the value of P(di|h, xi) in below equation
We get
We write an expression for the maximum likelihood hypothesis
This equation describes
the quantity that must
be maximized in order to
obtain the maximum
likelihood hypothesis in
our current problem
setting
Minimum Description Length Principle
34
A Bayesian perspective on Occam’s razor (inductive bias summarized as
“choose the shortest explanation for the observed data”
Motivated by interpreting the definition of hMAP in the light of basic concepts from
information theory.
which can be equivalently expressed in terms of maximizing the log2
or alternatively, minimizing the negative of this quantity
Equ 1
Minimum Description Length Principle…
35
This equation (1) can be interpreted as a statement that short hypotheses are
preferred, assuming a particular representation scheme for encoding
hypotheses and data
-log2P(h): the description length of h under the optimal encoding for
the hypothesis space H, LCH (h) = −log2P(h), where CH is the optimal code
for hypothesis space H.
-log2P(D | h): the description length of the training data D given
hypothesis h, under the optimal encoding from the hypothesis space H: LCH
(D|h) = −log2P(D| h) , where C D|h is the optimal code for describing data D
assuming that both the sender and receiver know the hypothesis h.
Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given by
the description length of the hypothesis plus the description length of the data given the
hypothesis.
Where, CH and CD|h are the optimal encodings for H and for D given h
Minimum Description Length Principle…
36
The Minimum Description Length (MDL) principle recommends choosing
the hypothesis that minimizes the sum of these two description lengths of
equ.
Minimum Description Length principle:
Where, codes C1 and C2 to represent the hypothesis and the data given the
hypothesis
The above analysis shows that if we choose C1 to be the optimal encoding of
hypotheses CH, and if we choose C2 to be the optimal encoding CD|h, then
hMDL = hMAP
Optimal Bayes Classifier
37
Consider a hypothesis space consisting of 3 hypothesis h1,h2,h3 and posterior
probabilities of these hypothesis given the training data are 0.4,0.3,0.3.
h1 is MAP hypothesis. New instance x is encountered, classified positive by h1
and negative by h2,h3.
Taking all hypothesis into account the probability that x is positive is 0.4 the
probability that it is negative is 0.6
The most probable classification generated by MAP hypothesis is negative.
Most probable classification of new instance is obtained by combining the
predictions of all hypothesis weighted by their posterior probabilities.
Any value vj from some set V, the probability P(vj|D) that the correct
classification for the new instance is vj ,is just
The optimal classification of new instance is the value vj for which P(vj|D) is
maximum.
Optimal Bayes Classifier…
38
The optimal classification of new instance is the value vj for which P(vj|D) is
maximum.
Bayes optimal classification
Gibbs Algorithm
39
Although the Bayes optimal classifier obtains the best performance that
can be achieved from the given training data, it can be quite costly to
apply. The expense is due to the fact that it computes the posterior
probability for every hypothesis in H and then combines the
predictions of each hypothesis to classify each new instance.
An alternative, less optimal method -Gibbs algorithm
Choose a hypothesis h from H at random, according to the
posterior probability distribution over H.
Use h to predict the classification of the next instance x.
Under certain conditions, Classifying the next instance according to a
hypothesis drawn at random from the current version space
(according to a uniform distribution), will have expected error at
most twice that of the Bayes optimal classiier.
Naïve Bayes Classifier
40
 The naive Bayes classifier applies to learning tasks where each instance x
is described by a conjunction of attribute values and where the target
function f (x) can take on any value from some finite set V.
 A set of training examples of the target function is provided, and a new
instance is presented, described by the tuple of attribute values (al, a2..
.am).
 The learner is asked to predict the target value, or classification,
for this new instance.
The Bayesian approach to classifying the new instance is to assign the most
probable target value, VMAP, given the attribute values (al, a2.. .am)
that describe the instance
Use Bayes theorem to rewrite this expression as
Naïve Bayes Classifier…
41
 The naive Bayes classifier is based on the assumption that the attribute
values are conditionally independent given the target value. Means,
the assumption is that given the target value of the instance, the
probability of observing the conjunction (al, a2.. .am), is just
the product of the probabilities for the individual attributes:
 Substituting this into eq(1)
 Naïve Bayes Classifier:
 Where, VNB denotes the target value output by the naive Bayes
classifier
Naïve Bayes Classifier…
42
 An Illustrative Example
 Let us apply the naive Bayes classifier to a concept learning problem
i.e., classifying days according to whether someone will play tennis.
 The below table provides a set of 14 training examples of the target
concept PlayTennis, where each day is described by the attributes
Outlook, Temperature, Humidity, and Wind
Naïve Bayes Classifier…
43
 An Illustrative Example
 Use the naive Bayes classifier and the training data from this table to
classify the following novel instance:
< Outlook = sunny, Temperature = cool, Humidity = high, Wind =
strong >
 Our task is to predict the target value (yes or no) of the target
concept PlayTennis for this new instance
 The probabilities of the different target values can easily be
estimated based on their frequencies over the 14 training
examples
 P(P1ayTennis = yes) = 9/14 = 0.64
Naïve Bayes Classifier…
44
 An Illustrative Example
 The probabilities of the different target values can easily be
estimated based on their frequencies over the 14 training
examples
 P(P1ayTennis = yes) = 9/14 = 0.64
 P(P1ayTennis = no) = 5/14 = 0.36
Naïve Bayes Classifier…
45
 An Illustrative Example
 Similarly, estimate the conditional probabilities. For example, those for Wind
= strong
 P(Wind = strong | PlayTennis = yes) = 3/9 = 0.33
 P(Wind = strong | PlayTennis = no) = 3/5 = 0.60
 Calculate VNB according to Equation (1)
 Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this
new instance, based on the probability estimates learned from the training data.
 By normalizing the above quantities to sum to one, calculate the conditional
probability that the target value is no, given the observed attribute values
Naïve Bayes Classifier…
46
 Estimating Probabilities
 We have estimated probabilities by the fraction of times the event is observed to
occur over the total number of opportunities.
 For example, in the above case we estimated P(Wind = strong | Play Tennis =
no) by the fraction nc /n where, n = 5 is the total number of training
examples for which PlayTennis = no, and nc = 3 is the number of these
for which Wind = strong.
 When nc = 0, then nc /n will be zero and this probability term will dominate the
quantity calculated in Equation (2) requires multiplying all the other probability
terms by this zero value
 To avoid this difficulty we can adopt a Bayesian approach to estimating the
probability, using the m-estimate defined as follows
 m -estimate of probability:
Naïve Bayes Classifier…
47
 Estimating Probabilities
 m-estimate of probability:
 p is our prior estimate of the probability we wish to determine, and m is a
constant called the equivalent sample size, which determines how heavily to
weight p relative to the observed data
 Method for choosing p in the absence of other information is to assume
uniform priors; that is, if an attribute has k possible values we set p =
1 /k.
Bayesian Belief Networks
48
 The naive Bayes classifier makes significant use of the assumption that the values
of the attributes a1 . . .an are conditionally independent given the target value v.
 This assumption dramatically reduces the complexity of learning the target
function
 A Bayesian belief network describes the probability distribution governing a set of
variables by specifying a set of conditional independence assumptions along
with a set of conditional probabilities
 Bayesian belief networks allow stating conditional independence assumptions that
apply to subsets of the variables
Bayesian Belief Networks…
49
 Consider an arbitrary set of random variables Y1 . . . Yn , where each variable
Yi can take on the set of possible values V(Yi).
 The joint space of the set of variables Y to be the cross product V(Y1) x
V(Y2) x. . V(Yn).
 In other words, each item in the joint space corresponds to one of the possible
assignments of values to the tuple of variables (Y1 . . . Yn). The probability
distribution over this joint' space is called the joint probability distribution.
 The joint probability distribution specifies the probability for each of the
possible variable bindings for the tuple (Y1 . . . Yn).
 A Bayesian belief network describes the joint probability distribution for a set
of variables
Bayesian Belief Networks…
50
 Conditional Independence
 Let X, Y, and Z be three discrete-valued random variables. X is
conditionally independent of Y given Z if the probability distribution
governing X is independent of the value of Y given a value for Z, that is,
if
 Where
 The above expression is written in abbreviated form as
 P(X | Y, Z) = P(X | Z)
 Conditional independence can be extended to sets of variables. The set of
variables X1 . . . Xl is conditionally independent of the set of variables Y1
. . . Ym given the set of variables Z1 . . . Zn if
Bayesian Belief Networks…
51
 Representation
A Bayesian belief network represents the joint probability distribution for a set of
variables. Bayesian networks (BN) are represented by directed acyclic graphs.
 The Bayesian network in above figure represents the joint probability distribution
over the boolean variables Storm, Lightning, Thunder, ForestFire, Campfire,
and BusTourGroup
Bayesian Belief Networks…
52
 Representation
 A Bayesian network (BN) represents the joint probability distribution by specifying a
set of conditional independence assumptions
 BN represented by a directed acyclic graph, together with sets of local conditional
probabilities
 Each variable in the joint space is represented by a node in the Bayesian network
 The network arcs represent the assertion that the variable is conditionally
independent of its non-descendants in the network given its immediate
predecessors in the network.
 A conditional probability table (CPT) is given for each variable, describing the
probability distribution for that variable given the values of its immediate
predecessors
 The joint probability for any desired assignment of values (y1, . . . , yn) to the tuple
of network variables (Y1 . . . Ym) can be computed by the formula
Where, Parents(Yi) denotes the set of immediate predecessors of Yi in the network.
Bayesian Belief Networks…
53
Introduction
A Bayesian network is a directed graph in which each node is annotated
with quantitative probability information. Full specification is :
Bayesian Belief Networks…
54
Introduction
The topology of the network—the set of nodes and links—specifies the conditional
independence relationships that hold in the domain, in a way that will be made
precise shortly
The intuitive meaning of an arrow is typically that X has a direct influence on
,which suggests that causes should be parents of effects
Once the topology of the Bayes net is laid out, we need only specify the local
probability information for each variable, in the form of a conditional distribution
given its parents.
The full joint distribution for all the variables is defined by the topology and the
local probability information.
Bayesian Belief Networks…
55
Introduction
Bayesian Belief Networks…
56
Introduction
Bayesian Network Example
Bayesian Belief Networks…
57
Introduction
Bayesian Network Example
Bayesian Belief Networks…
58
Introduction
Bayesian Belief Networks…
59
Introduction
Bayesian Belief Networks…
60
Introduction
Bayesian Belief Networks…
61
Introduction
Bayesian Belief Networks…
62
Introduction
Bayesian Belief Networks…
63
 Example
 Consider the node Campfire. The network nodes and arcs represent the assertion that
Campfire is conditionally independent of its non-descendants Lightning and
Thunder, given its immediate parents Storm and BusTourGroup.
 This means that once we know the value of the variables Storm and BusTourGroup,
the variables Lightning and Thunder provide no additional information about
Campfire
 The conditional probability table associated with the variable Campfire. The assertion is
 P(Campfire = True | Storm = True, BusTourGroup = True) = 0.4
Bayesian Belief Networks…
64
 Example
 Inference
 Use a Bayesian network to infer the value of some target variable (e.g.,
ForestFire) given the observed values of the other variables.
 Inference can be straightforward if values for all of the other variables in the
network are known exactly.
 A Bayesian network can be used to compute the probability distribution for
any subset of network variables given the values or distributions for any
subset of the remaining variables.
 An arbitrary Bayesian network is known to be NP-hard
Bayesian Belief Networks…
65
 Learning Bayesian Belief Networks
 Effective algorithms can be considered for learning Bayesian belief networks from
training data by considering several different settings for learning problem
 First, the network structure might be given in advance, or it might have to be
inferred from the training data.
 Second, all the network variables might be directly observable in each training
example, or some might be unobservable.
 In the case where the network structure is given in advance and the
variables are fully observable in the training examples, learning the
conditional probability tables is straightforward and estimate the
conditional probability table entries
 In the case where the network structure is given but only some of the
variable values are observable in the training data, the learning
problem is more difficult.
 The learning problem can be compared to learning weights for an ANN.
EM(Expectation –Maximization) Algorithm
66
 In the real world applications of Machine Learning it s very common that there
are many relevant features available for learning but only a small subset of
them are observable.
 The Expectation-Maximization algorithm can be used for the latent variable (
variables that are not directly observable and are actually inferred from the
values of other observed variables).
 This algorithm is the base for many unsupervised clustering algorithms in the
field of Machine Learning.
 The essence of Expectation-Maximization algorithm is to use the available
observed data of the dataset to estimate the missing data and then using that
data to update the values of the parameters.
EM(Expectation –Maximization) Algorithm
67
EM Algorithm:
 Step 1:Initially a set of values of parameters are considered. A set of incomplete
observed data is given to the system with the assumption that the observed data
comes from a specific model.
 Step 2:Next step is Expectation step or E step. Observed data is used in order to
estimate or guess the values of incomplete or missing data. It is basically used to
update the variables of interest.
 Step 3:Third step is Maximization step or M-step In this step the complete data
generated in E-step is used to update the values of parameters. It is used to update
the hypothesis.
 Step 4:Fourth step is to check convergence of values, if yes stop else repeat steps 2
and 3 until convergence.
EM(Expectation –Maximization) Algorithm
68
EM Algorithm:
EM(Expectation –Maximization) Algorithm
69
Estimating means of k Gaussians:
 The EM algorithm can be used for variables whose value is never directly observed.
Provided the general form of probability distribution governing these variables is
known.
EM(Expectation –Maximization) Algorithm
70
Usage of EM Algorithm:
•It can be used to fill the missing data in a sample.
•It can be used as the basis of unsupervised learning of clusters.
•It can be used for the purpose of estimating the parameters of Hidden Markov
Model (HMM).
•It can be used for discovering the values of latent variables.
EM(Expectation –Maximization) Algorithm
71
Advantages of EM Algorithm:
 It is always guaranteed that likelihood will increase with each iteration.
 The E-step and M-step are often pretty easy for many problems in terms of
implementation.
 Solutions to the M-steps often exist in the closed form.
Disadvantages of EM Algorithm:
 It has slow convergence.
 It makes convergence to the local optima only.
 It requires both the probabilities, forward and backward (numerical optimization
requires only forward probability).
K-Means Algorithm
72
 k.-Means algorithm is an unsupervised learning algorithm
 The algorithm makes very general assumptions regarding the distribution of the
data.
 The K-means algorithm attempts to detect clusters within the dataset under
the optimization criteria that the sum of the inter-cluster variances is
minimized.
 Hence K-Means clustering algorithm produces a Minimum Variance Estimate
(MVE) of the state of the identified clusters in the data.
 The method works well in detecting homogeneous clusters, it inevitably falls
short due to the simplistic assumption about the spherical nature of the clusters
inherent in its optimization function (i.e. it assumes that all the clusters have
equal covariance matrices).
K-Means Algorithm
73
 Unlike K-means, in EM, the clusters are not limited to spherical shapes. In
EM we can constrain the algorithm to provide different covariance matrices
(spherical, diagonal and generic).
 These different covariance matrices in return allow us to control the shape of
our clusters and hence we can detect sub-populations in our data with
different characteristics.

More Related Content

What's hot

Concept learning and candidate elimination algorithm
Concept learning and candidate elimination algorithmConcept learning and candidate elimination algorithm
Concept learning and candidate elimination algorithmswapnac12
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine LearningPavithra Thippanaik
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERKnoldus Inc.
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learningbutest
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classificationsathish sak
 
Instance based learning
Instance based learningInstance based learning
Instance based learningswapnac12
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDigiGurukul
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Learning rule of first order rules
Learning rule of first order rulesLearning rule of first order rules
Learning rule of first order rulesswapnac12
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxShubham Jaybhaye
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine LearningVARUN KUMAR
 

What's hot (20)

Concept learning and candidate elimination algorithm
Concept learning and candidate elimination algorithmConcept learning and candidate elimination algorithm
Concept learning and candidate elimination algorithm
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classification
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th Sem
 
Graph coloring problem
Graph coloring problemGraph coloring problem
Graph coloring problem
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Learning rule of first order rules
Learning rule of first order rulesLearning rule of first order rules
Learning rule of first order rules
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Learning With Complete Data
Learning With Complete DataLearning With Complete Data
Learning With Complete Data
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptx
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
 

Similar to Machine Learning Course Overview

Module 4_F.pptx
Module  4_F.pptxModule  4_F.pptx
Module 4_F.pptxSupriyaN21
 
-BayesianLearning in machine Learning 12
-BayesianLearning in machine Learning 12-BayesianLearning in machine Learning 12
-BayesianLearning in machine Learning 12Kumari Naveen
 
Bayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering College
Bayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering CollegeBayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering College
Bayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering CollegeDhivyaa C.R
 
Bayesian Learning- part of machine learning
Bayesian Learning-  part of machine learningBayesian Learning-  part of machine learning
Bayesian Learning- part of machine learningkensaleste
 
bayesNaive.ppt
bayesNaive.pptbayesNaive.ppt
bayesNaive.pptOmDalvi4
 
bayesNaive algorithm in machine learning
bayesNaive algorithm in machine learningbayesNaive algorithm in machine learning
bayesNaive algorithm in machine learningKumari Naveen
 
original
originaloriginal
originalbutest
 
Regression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machineRegression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machineDr. Radhey Shyam
 
Combining inductive and analytical learning
Combining inductive and analytical learningCombining inductive and analytical learning
Combining inductive and analytical learningswapnac12
 
Statistical Machine________ Learning.ppt
Statistical Machine________ Learning.pptStatistical Machine________ Learning.ppt
Statistical Machine________ Learning.pptSandeepGupta229023
 
Bayesian statistics for biologists and ecologists
Bayesian statistics for biologists and ecologistsBayesian statistics for biologists and ecologists
Bayesian statistics for biologists and ecologistsMasahiro Ryo. Ph.D.
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Course on Bayesian computational methods
Course on Bayesian computational methodsCourse on Bayesian computational methods
Course on Bayesian computational methodsChristian Robert
 
San Antonio short course, March 2010
San Antonio short course, March 2010San Antonio short course, March 2010
San Antonio short course, March 2010Christian Robert
 

Similar to Machine Learning Course Overview (20)

Bayes Theorem.pdf
Bayes Theorem.pdfBayes Theorem.pdf
Bayes Theorem.pdf
 
Module 4_F.pptx
Module  4_F.pptxModule  4_F.pptx
Module 4_F.pptx
 
-BayesianLearning in machine Learning 12
-BayesianLearning in machine Learning 12-BayesianLearning in machine Learning 12
-BayesianLearning in machine Learning 12
 
Bayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering College
Bayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering CollegeBayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering College
Bayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering College
 
UNIT II (7).pptx
UNIT II (7).pptxUNIT II (7).pptx
UNIT II (7).pptx
 
Bayesian Learning- part of machine learning
Bayesian Learning-  part of machine learningBayesian Learning-  part of machine learning
Bayesian Learning- part of machine learning
 
bayesNaive.ppt
bayesNaive.pptbayesNaive.ppt
bayesNaive.ppt
 
bayesNaive.ppt
bayesNaive.pptbayesNaive.ppt
bayesNaive.ppt
 
bayesNaive algorithm in machine learning
bayesNaive algorithm in machine learningbayesNaive algorithm in machine learning
bayesNaive algorithm in machine learning
 
original
originaloriginal
original
 
Lecture5 xing
Lecture5 xingLecture5 xing
Lecture5 xing
 
Regression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machineRegression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machine
 
Combining inductive and analytical learning
Combining inductive and analytical learningCombining inductive and analytical learning
Combining inductive and analytical learning
 
Statistical Machine________ Learning.ppt
Statistical Machine________ Learning.pptStatistical Machine________ Learning.ppt
Statistical Machine________ Learning.ppt
 
Bayesian statistics for biologists and ecologists
Bayesian statistics for biologists and ecologistsBayesian statistics for biologists and ecologists
Bayesian statistics for biologists and ecologists
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
ppt
pptppt
ppt
 
Course on Bayesian computational methods
Course on Bayesian computational methodsCourse on Bayesian computational methods
Course on Bayesian computational methods
 
San Antonio short course, March 2010
San Antonio short course, March 2010San Antonio short course, March 2010
San Antonio short course, March 2010
 

Recently uploaded

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 

Recently uploaded (20)

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 

Machine Learning Course Overview

  • 1. MACHINE LEARNING (20BT60501) COURSE DESCRIPTION: Concept learning, General to specific ordering, Decision tree learning, Support vector machine, Artificial neural networks, Multilayer neural networks, Bayesian learning, Instance based learning, reinforcement learning.
  • 2. Subject :MACHINE LEARNING -(20BT60501) Topic: Unit IV – BAYESIAN LEARNING Prepared By: Dr.J.Avanija Professor Dept. of CSE Sree Vidyanikethan Engineering College Tirupati.
  • 3. Unit IV – BAYESIAN LEARNING Bayes Theorem and Concept Learning Maximum likelihood and least-squared error hypothesis Maximum likelihood hypotheses for predicting probabilities Minimum Description Length principle Bayes optimal classifier Gibbs algorithm Naive Bayes classifier An Example – Learning to classify text Bayesian belief networks EM Algorithm
  • 4. Introduction 4 Bayesian Learning provides a probabilistic approach to inference. It is based on the assumption that the quantities of interest are governed by probability distributions and that optimal decisions can be made by reasoning about these probabilities together with observed data. Bayesian learning algorithms calculate explicit probabilities for hypotheses.
  • 5. Features of Bayesian Learning Methods 5 Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more flexible approach to learning than algorithms that completely eliminate a hypothesis if it is found to be inconsistent with any single example. Prior knowledge can be combined with observed data to determine the final probability of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a prior probability for each candidate hypothesis, and (2) a probability distribution over observed data for each possible hypothesis. Bayesian methods can accommodate hypotheses that make probabilistic predictions (e.g., hypotheses such as "this pneumonia patient has a 93% chance of complete recovery").
  • 6. Features of Bayesian Learning Methods 6 New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities. They require initial knowledge of many probabilities. When these probabilities are not known in advance they are often estimated based on background knowledge, previously available data, and assumptions about the form of the underlying distributions. They require significant computational cost. They can provide a standard of optimal decision making against which other practical methods can be measured.
  • 7. Bayes Theorem 7 Terminology H is Hypothesis Space; h is hypothesis. D is training dataset ; d is a training sample. P(h) denotes the prior probability that hypothesis h holds, before training data is observed. It reflects any background knowledge we have about the chance that h is a correct hypothesis. If we have no such prior knowledge, then we might simply assign the same prior probability to each candidate hypothesis. P(D) denotes the prior probability that training data D will be observed (i.e., the probability of D given no knowledge about which hypothesis holds). P(D | h) denotes the probability of observing data D given hypothesis h holds. P (h | D) denotes posterior probability of h that h holds given the observed training data D. Bayes Theorem
  • 8. Maximum a Posteriori (MAP) Hypothesis 8 For a given problem there can be multiple hypothesis statements that are true. The most (maximally) probable hypothesis for the given data D is called as the MAP. Assuming that P(h) is same for all hypothesis, P(D l h) is often called the likelihood of the data D given h, and any hypothesis that maximizes P(D l h) is called a maximum likelihood (ML) hypothesis. P(D) is ignored since it is common for all hypothesis
  • 9. Bayes Rule - Medical Diagnosis Problem 9 Two alternative hypotheses: (1) that the patient has a particular form of cancer. Hypothesis h = cancer (2) that the patient does not have cancer. Hypothesis -h = No cancer Dataset contains patients data belonging to two classes : positive class + negative class - Prior Knowledge: over the entire population of people only .008 have this disease. P(h) = 0.08 P ( - h) = 0.992 The test returns a correct positive result in only 98% of the cases in which the disease is actually present. P( + | h) = 0.98 P( - | h) = 0.02 A correct negative result in only 97% of the cases in which the disease is not present. P( + | -h) = 0.03 P( - | -h) = 0.97
  • 10. Bayes Rule - Medical Diagnosis Problem… 10 Suppose we now observe a new patient for whom the lab test returns a positive result. Should we diagnose the patient as having cancer or not? The maximum a posteriori hypothesis can be found as follows: P( + | h) = 0.98 P(h) = 0.08 P( h | D ) = 0.98 x 0.08 = 0.0078 P( + | -h) = 0.03 P(-h) = 0.992 P ( -h | D ) = 0.03 x 0.992 = 0.298 hMAP = P ( -h | D ) There is no cancer for the patient. The lab test might be false positive.
  • 11. Bayes Theorem and Concept Learning 11 Bayes theorem can be used as a basis for a straight forward learning algorithm since that calculates the probability of hypothesis and then outputs probable one. Brute-force Bayesian learning algorithm compared with concept learning algorithms. Summary of basic probability formulas:
  • 12. Bayes Theorem and Concept Learning... 12 Brute-Force MAP Learning Algorithm Concept learning algorithm to output Maximum a Posterori(MAP) hypothesis based on Bayes theorem:
  • 13. Bayes Theorem and Concept Learning... 13 Brute-Force MAP Learning Algorithm This algorithm may require significant computation because it applies Bayes theorem to each hypothesis in H to calculate p(h|D) This is impractical for large hypothesis spaces The algorithm is still of interest because it provides standard against which we may judge the performance of other concept learning algorithms. Brute-Force MAP algorithm must specify values for P(h) and P(D|h) P(h) and P(D|h) can be chosen to be consistent with the following assumptions:
  • 14. Bayes Theorem and Concept Learning... 14 Brute-Force MAP Learning Algorithm Given these assumptions what values should we specify for P(h). Given no prior knowledge that one hypothesis is more likely than another, it is reasonable to assign the same prior probability to every hypothesis h in H. Furthermore, because we assume the target concept is contained in H we should require that these prior probabilities sum to 1. Together these constraints imply that we should choose
  • 15. Bayes Theorem and Concept Learning... 15 Brute-Force MAP Learning Algorithm What choice shall we make for P(D|h)? P(D|h) is the probability of observing the target values D=<d1,…dm> for the fixed set of instances <X1,X2,…Xm>, given a world in which h holds(h is the correct description of target c) Since we assume noise free training data , the probability of observing classification di given h is just 1 if di=h(xi) and 0 if di!=h(xi) Therefore, The probability of data D given hypothesis h is 1 if D is consistent with h otherwise 0.
  • 16. Bayes Theorem and Concept Learning... 16 Brute-Force MAP Learning Algorithm Let us consider the first step of the algorithm which uses Bayes theorem to compute the posterior probability P(h|D) of each hypothesis h given the observed training Data D. First, consider the case where h is inconsistent with the training Data D. We know that P(D|h) to be 0 when h is inconsistent with D we have,
  • 17. Bayes Theorem and Concept Learning... 17 Brute-Force MAP Learning Algorithm Now consider the case where h is consistent with D. We know that P(D|h) to be 1 when h is consistent with D we have Subset of hypothesis consistent with D- Version space of H with respect to D
  • 18. Bayes Theorem and Concept Learning... 18 Brute-Force MAP Learning Algorithm To summarize Bayes theorem implies that the posterior probability P(h|D) under our assumed P(h) and P(D|h) is:
  • 19. Bayes Theorem and Concept Learning... 19 MAP Hypothesis and Consistent Learners Every hypothesis consistent with D is a Maximum A Posteriori (MAP) hypothesis. This statement translates into statement about general class of learners called as consistent learners. A learning algorithm is termed as a consistent learner provided it outputs a hypothesis that commits zero errors over the training examples. Every consistent learner outputs a MAP hypothesis if we assume a uniform prior probability distribution over H, deterministic, noise free training data Find-S searches the hypothesis space H from specific to general hypothesis, outputting a maximally specific consistent hypothesis. Output hypothesis will be MAP hypothesis. Bayesian framework can be used to characterize the behavior of learning algorithms (Find-S) even when the learning algorithm does not explicitly manipulate probabilities.
  • 20. Bayes Theorem and Concept Learning... 20 Evolution of Probabilities Associated with Hypotheses Figure (a) all hypotheses have the same probability. Figures (b) and (c), As training data accumulates, the posterior probability for inconsistent hypotheses becomes zero while the total probability summing to 1 is shared equally among the remaining consistent hypotheses.
  • 21. Bayes Theorem and Concept Learning... 21 MAP Hypothesis and Consistent Learners Many learning approaches such as neural network learning, linear regression, polynomial curve fitting try to learn a continuous valued target function. Not possible to use Bayesian learning for continuous valued target functions. Under a certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis predictions and training data will output a MAXIMUM LIKELIHOOD HYPOTHESIS. The significance of this result is that it provides a Bayesian justification (under certain assumptions) for many neural network and other curve fitting methods that attempts to minimize the sum of squared errors over training data.
  • 22. Maximum Likelihood and Least Squared Error Hypothesis 22 Consider the problem of learning a continuous-valued target function such as neural network learning, linear regression, and polynomial curve fitting A straightforward Bayesian analysis will show that under certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output a maximum likelihood (ML) hypothesis.
  • 23. Maximum Likelihood and Least Squared Error Hypothesis… 23 Learner L considers an instance space X and a hypothesis space H consisting of some class of real-valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training examples of the form <xi,di>  The problem faced by L is to learn an unknown target function f : X → R A set of m training examples is provided, where the target value of each example is corrupted by random noise drawn according to a Normal probability distribution with zero mean (di = f(xi) + ei) Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei . Here f(xi) is the noise-free value of the target function and ei is a random variable representing the noise. It is assumed that the values of the ei are drawn independently and that they are distributed according to a Normal distribution with zero mean. The task of the learner is to output a maximum likelihood hypothesis or a MAP hypothesis assuming all hypotheses are equally probable a priori.
  • 24. Maximum Likelihood and Least Squared Error Hypothesis… 24 Normal distribution function represented as A normal distribution (also known as Gaussian, Gauss, or Laplace– Gauss distribution) is a type of continuous probability distribution for a real- valued random variable. The general form of its probability density function is The parameter µ is the mean or expectation of the distribution (and also its median and mode), while the parameter σ is its standard deviation. The variance of the distribution is σ2 A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.
  • 25. Maximum Likelihood and Least Squared Error Hypothesis… 25 Data can be "distributed" (spread out) in different ways. Left Skewed Right Skewed Data can be "distributed" (spread out) in different ways.
  • 26. Maximum Likelihood and Least Squared Error Hypothesis… 26 There are many cases where the data tends to be around a central value with no bias left or right, and it gets close to a "Normal Distribution" like this: The normal distribution, also known as the Gaussian distribution, is the most important probability distribution in statistics for independent, random variables. Many things closely follow a Normal Distribution: heights of people size of things produced by machines errors in measurements blood pressure marks on a test
  • 27. Maximum Likelihood and Least Squared Error Hypothesis… 27 In order to find the maximum likelihood hypothesis in Bayesian learning for continuous valued target function we start with maximum likelihood hypothesis definition by lower case p to refer the probability density function. We assume a fixed set of training instances (x1,…xm) and therefore consider the data D to be the corresponding sequence of target values D=(d1,…dm) Here we can write p(D|h) as the product of various p(di|h) Substitute for p(di|h)
  • 28. Maximum Likelihood and Least Squared Error Hypothesis… 28 We now apply the transformation that is common in maximum likelihood calculations. Rather than maximizing the above complicated expression choose to maximize its less complicated logarithm. This is justified because ln(p)is a monotonic function of p, therefore maximizing ln(p) also maximizes p. Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each di must also obey a Normal distribution around the true target value f(xi). Because we are writing the expression for P(D|h), we assume h is the correct description of f. Hence, µ = f(xi) = h(xi)
  • 29. Maximum Likelihood and Least Squared Error Hypothesis… 29 First term is a constant discard it. Maximizing the negative quantity is equivalent to minimizing the corresponding positive quantity. Finally we can again discard the constants that are independent of h.
  • 30. Maximum Likelihood and Least Squared Error Hypothesis… 30 Above equation shows that the maximum likelihood hypothesis hML is the one that minimizes the sum of the squared errors between the observed training values di and the hypothesis predictions h(xi)
  • 31. Maximum Likelihood Hypothesis for Predicting Probabilities 31 Consider the setting in which we wish to learn a nondeterministic (probabilistic) function f : X → {0, 1}, which has two discrete output values. We want a function approximator whose output is the probability that f(x) = 1 In other words learn the target function f ` : X → [0, 1] such that f (x) = P(f(x) = 1) How can we learn f ` using a neural network? Use of brute force way would be to first collect the observed frequencies of 1's and 0's for each possible value of x and to then train the neural network to output the target frequency for each x.
  • 32. Maximum Likelihood Hypothesis for Predicting Probabilities… 32 What criterion should we optimize in order to find a maximum likelihood hypothesis for f' in this setting? First obtain an expression for P(D|h) Assume the training data D is of the form D = {(x1, d1) . . . (xm, dm)}, where di is the observed 0 or 1 value for f (xi). Both xi and di as random variables, and assuming that each training example is drawn independently, we can write P(D|h) as Applying the product rule
  • 33. Maximum Likelihood Hypothesis for Predicting Probabilities… 33 The probability P(di|h, xi) Re-express it in a more mathematically manipulable form, as Substitute the value of P(di|h, xi) in below equation We get We write an expression for the maximum likelihood hypothesis This equation describes the quantity that must be maximized in order to obtain the maximum likelihood hypothesis in our current problem setting
  • 34. Minimum Description Length Principle 34 A Bayesian perspective on Occam’s razor (inductive bias summarized as “choose the shortest explanation for the observed data” Motivated by interpreting the definition of hMAP in the light of basic concepts from information theory. which can be equivalently expressed in terms of maximizing the log2 or alternatively, minimizing the negative of this quantity Equ 1
  • 35. Minimum Description Length Principle… 35 This equation (1) can be interpreted as a statement that short hypotheses are preferred, assuming a particular representation scheme for encoding hypotheses and data -log2P(h): the description length of h under the optimal encoding for the hypothesis space H, LCH (h) = −log2P(h), where CH is the optimal code for hypothesis space H. -log2P(D | h): the description length of the training data D given hypothesis h, under the optimal encoding from the hypothesis space H: LCH (D|h) = −log2P(D| h) , where C D|h is the optimal code for describing data D assuming that both the sender and receiver know the hypothesis h. Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given by the description length of the hypothesis plus the description length of the data given the hypothesis. Where, CH and CD|h are the optimal encodings for H and for D given h
  • 36. Minimum Description Length Principle… 36 The Minimum Description Length (MDL) principle recommends choosing the hypothesis that minimizes the sum of these two description lengths of equ. Minimum Description Length principle: Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH, and if we choose C2 to be the optimal encoding CD|h, then hMDL = hMAP
  • 37. Optimal Bayes Classifier 37 Consider a hypothesis space consisting of 3 hypothesis h1,h2,h3 and posterior probabilities of these hypothesis given the training data are 0.4,0.3,0.3. h1 is MAP hypothesis. New instance x is encountered, classified positive by h1 and negative by h2,h3. Taking all hypothesis into account the probability that x is positive is 0.4 the probability that it is negative is 0.6 The most probable classification generated by MAP hypothesis is negative. Most probable classification of new instance is obtained by combining the predictions of all hypothesis weighted by their posterior probabilities. Any value vj from some set V, the probability P(vj|D) that the correct classification for the new instance is vj ,is just The optimal classification of new instance is the value vj for which P(vj|D) is maximum.
  • 38. Optimal Bayes Classifier… 38 The optimal classification of new instance is the value vj for which P(vj|D) is maximum. Bayes optimal classification
  • 39. Gibbs Algorithm 39 Although the Bayes optimal classifier obtains the best performance that can be achieved from the given training data, it can be quite costly to apply. The expense is due to the fact that it computes the posterior probability for every hypothesis in H and then combines the predictions of each hypothesis to classify each new instance. An alternative, less optimal method -Gibbs algorithm Choose a hypothesis h from H at random, according to the posterior probability distribution over H. Use h to predict the classification of the next instance x. Under certain conditions, Classifying the next instance according to a hypothesis drawn at random from the current version space (according to a uniform distribution), will have expected error at most twice that of the Bayes optimal classiier.
  • 40. Naïve Bayes Classifier 40  The naive Bayes classifier applies to learning tasks where each instance x is described by a conjunction of attribute values and where the target function f (x) can take on any value from some finite set V.  A set of training examples of the target function is provided, and a new instance is presented, described by the tuple of attribute values (al, a2.. .am).  The learner is asked to predict the target value, or classification, for this new instance. The Bayesian approach to classifying the new instance is to assign the most probable target value, VMAP, given the attribute values (al, a2.. .am) that describe the instance Use Bayes theorem to rewrite this expression as
  • 41. Naïve Bayes Classifier… 41  The naive Bayes classifier is based on the assumption that the attribute values are conditionally independent given the target value. Means, the assumption is that given the target value of the instance, the probability of observing the conjunction (al, a2.. .am), is just the product of the probabilities for the individual attributes:  Substituting this into eq(1)  Naïve Bayes Classifier:  Where, VNB denotes the target value output by the naive Bayes classifier
  • 42. Naïve Bayes Classifier… 42  An Illustrative Example  Let us apply the naive Bayes classifier to a concept learning problem i.e., classifying days according to whether someone will play tennis.  The below table provides a set of 14 training examples of the target concept PlayTennis, where each day is described by the attributes Outlook, Temperature, Humidity, and Wind
  • 43. Naïve Bayes Classifier… 43  An Illustrative Example  Use the naive Bayes classifier and the training data from this table to classify the following novel instance: < Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong >  Our task is to predict the target value (yes or no) of the target concept PlayTennis for this new instance  The probabilities of the different target values can easily be estimated based on their frequencies over the 14 training examples  P(P1ayTennis = yes) = 9/14 = 0.64
  • 44. Naïve Bayes Classifier… 44  An Illustrative Example  The probabilities of the different target values can easily be estimated based on their frequencies over the 14 training examples  P(P1ayTennis = yes) = 9/14 = 0.64  P(P1ayTennis = no) = 5/14 = 0.36
  • 45. Naïve Bayes Classifier… 45  An Illustrative Example  Similarly, estimate the conditional probabilities. For example, those for Wind = strong  P(Wind = strong | PlayTennis = yes) = 3/9 = 0.33  P(Wind = strong | PlayTennis = no) = 3/5 = 0.60  Calculate VNB according to Equation (1)  Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this new instance, based on the probability estimates learned from the training data.  By normalizing the above quantities to sum to one, calculate the conditional probability that the target value is no, given the observed attribute values
  • 46. Naïve Bayes Classifier… 46  Estimating Probabilities  We have estimated probabilities by the fraction of times the event is observed to occur over the total number of opportunities.  For example, in the above case we estimated P(Wind = strong | Play Tennis = no) by the fraction nc /n where, n = 5 is the total number of training examples for which PlayTennis = no, and nc = 3 is the number of these for which Wind = strong.  When nc = 0, then nc /n will be zero and this probability term will dominate the quantity calculated in Equation (2) requires multiplying all the other probability terms by this zero value  To avoid this difficulty we can adopt a Bayesian approach to estimating the probability, using the m-estimate defined as follows  m -estimate of probability:
  • 47. Naïve Bayes Classifier… 47  Estimating Probabilities  m-estimate of probability:  p is our prior estimate of the probability we wish to determine, and m is a constant called the equivalent sample size, which determines how heavily to weight p relative to the observed data  Method for choosing p in the absence of other information is to assume uniform priors; that is, if an attribute has k possible values we set p = 1 /k.
  • 48. Bayesian Belief Networks 48  The naive Bayes classifier makes significant use of the assumption that the values of the attributes a1 . . .an are conditionally independent given the target value v.  This assumption dramatically reduces the complexity of learning the target function  A Bayesian belief network describes the probability distribution governing a set of variables by specifying a set of conditional independence assumptions along with a set of conditional probabilities  Bayesian belief networks allow stating conditional independence assumptions that apply to subsets of the variables
  • 49. Bayesian Belief Networks… 49  Consider an arbitrary set of random variables Y1 . . . Yn , where each variable Yi can take on the set of possible values V(Yi).  The joint space of the set of variables Y to be the cross product V(Y1) x V(Y2) x. . V(Yn).  In other words, each item in the joint space corresponds to one of the possible assignments of values to the tuple of variables (Y1 . . . Yn). The probability distribution over this joint' space is called the joint probability distribution.  The joint probability distribution specifies the probability for each of the possible variable bindings for the tuple (Y1 . . . Yn).  A Bayesian belief network describes the joint probability distribution for a set of variables
  • 50. Bayesian Belief Networks… 50  Conditional Independence  Let X, Y, and Z be three discrete-valued random variables. X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value for Z, that is, if  Where  The above expression is written in abbreviated form as  P(X | Y, Z) = P(X | Z)  Conditional independence can be extended to sets of variables. The set of variables X1 . . . Xl is conditionally independent of the set of variables Y1 . . . Ym given the set of variables Z1 . . . Zn if
  • 51. Bayesian Belief Networks… 51  Representation A Bayesian belief network represents the joint probability distribution for a set of variables. Bayesian networks (BN) are represented by directed acyclic graphs.  The Bayesian network in above figure represents the joint probability distribution over the boolean variables Storm, Lightning, Thunder, ForestFire, Campfire, and BusTourGroup
  • 52. Bayesian Belief Networks… 52  Representation  A Bayesian network (BN) represents the joint probability distribution by specifying a set of conditional independence assumptions  BN represented by a directed acyclic graph, together with sets of local conditional probabilities  Each variable in the joint space is represented by a node in the Bayesian network  The network arcs represent the assertion that the variable is conditionally independent of its non-descendants in the network given its immediate predecessors in the network.  A conditional probability table (CPT) is given for each variable, describing the probability distribution for that variable given the values of its immediate predecessors  The joint probability for any desired assignment of values (y1, . . . , yn) to the tuple of network variables (Y1 . . . Ym) can be computed by the formula Where, Parents(Yi) denotes the set of immediate predecessors of Yi in the network.
  • 53. Bayesian Belief Networks… 53 Introduction A Bayesian network is a directed graph in which each node is annotated with quantitative probability information. Full specification is :
  • 54. Bayesian Belief Networks… 54 Introduction The topology of the network—the set of nodes and links—specifies the conditional independence relationships that hold in the domain, in a way that will be made precise shortly The intuitive meaning of an arrow is typically that X has a direct influence on ,which suggests that causes should be parents of effects Once the topology of the Bayes net is laid out, we need only specify the local probability information for each variable, in the form of a conditional distribution given its parents. The full joint distribution for all the variables is defined by the topology and the local probability information.
  • 63. Bayesian Belief Networks… 63  Example  Consider the node Campfire. The network nodes and arcs represent the assertion that Campfire is conditionally independent of its non-descendants Lightning and Thunder, given its immediate parents Storm and BusTourGroup.  This means that once we know the value of the variables Storm and BusTourGroup, the variables Lightning and Thunder provide no additional information about Campfire  The conditional probability table associated with the variable Campfire. The assertion is  P(Campfire = True | Storm = True, BusTourGroup = True) = 0.4
  • 64. Bayesian Belief Networks… 64  Example  Inference  Use a Bayesian network to infer the value of some target variable (e.g., ForestFire) given the observed values of the other variables.  Inference can be straightforward if values for all of the other variables in the network are known exactly.  A Bayesian network can be used to compute the probability distribution for any subset of network variables given the values or distributions for any subset of the remaining variables.  An arbitrary Bayesian network is known to be NP-hard
  • 65. Bayesian Belief Networks… 65  Learning Bayesian Belief Networks  Effective algorithms can be considered for learning Bayesian belief networks from training data by considering several different settings for learning problem  First, the network structure might be given in advance, or it might have to be inferred from the training data.  Second, all the network variables might be directly observable in each training example, or some might be unobservable.  In the case where the network structure is given in advance and the variables are fully observable in the training examples, learning the conditional probability tables is straightforward and estimate the conditional probability table entries  In the case where the network structure is given but only some of the variable values are observable in the training data, the learning problem is more difficult.  The learning problem can be compared to learning weights for an ANN.
  • 66. EM(Expectation –Maximization) Algorithm 66  In the real world applications of Machine Learning it s very common that there are many relevant features available for learning but only a small subset of them are observable.  The Expectation-Maximization algorithm can be used for the latent variable ( variables that are not directly observable and are actually inferred from the values of other observed variables).  This algorithm is the base for many unsupervised clustering algorithms in the field of Machine Learning.  The essence of Expectation-Maximization algorithm is to use the available observed data of the dataset to estimate the missing data and then using that data to update the values of the parameters.
  • 67. EM(Expectation –Maximization) Algorithm 67 EM Algorithm:  Step 1:Initially a set of values of parameters are considered. A set of incomplete observed data is given to the system with the assumption that the observed data comes from a specific model.  Step 2:Next step is Expectation step or E step. Observed data is used in order to estimate or guess the values of incomplete or missing data. It is basically used to update the variables of interest.  Step 3:Third step is Maximization step or M-step In this step the complete data generated in E-step is used to update the values of parameters. It is used to update the hypothesis.  Step 4:Fourth step is to check convergence of values, if yes stop else repeat steps 2 and 3 until convergence.
  • 69. EM(Expectation –Maximization) Algorithm 69 Estimating means of k Gaussians:  The EM algorithm can be used for variables whose value is never directly observed. Provided the general form of probability distribution governing these variables is known.
  • 70. EM(Expectation –Maximization) Algorithm 70 Usage of EM Algorithm: •It can be used to fill the missing data in a sample. •It can be used as the basis of unsupervised learning of clusters. •It can be used for the purpose of estimating the parameters of Hidden Markov Model (HMM). •It can be used for discovering the values of latent variables.
  • 71. EM(Expectation –Maximization) Algorithm 71 Advantages of EM Algorithm:  It is always guaranteed that likelihood will increase with each iteration.  The E-step and M-step are often pretty easy for many problems in terms of implementation.  Solutions to the M-steps often exist in the closed form. Disadvantages of EM Algorithm:  It has slow convergence.  It makes convergence to the local optima only.  It requires both the probabilities, forward and backward (numerical optimization requires only forward probability).
  • 72. K-Means Algorithm 72  k.-Means algorithm is an unsupervised learning algorithm  The algorithm makes very general assumptions regarding the distribution of the data.  The K-means algorithm attempts to detect clusters within the dataset under the optimization criteria that the sum of the inter-cluster variances is minimized.  Hence K-Means clustering algorithm produces a Minimum Variance Estimate (MVE) of the state of the identified clusters in the data.  The method works well in detecting homogeneous clusters, it inevitably falls short due to the simplistic assumption about the spherical nature of the clusters inherent in its optimization function (i.e. it assumes that all the clusters have equal covariance matrices).
  • 73. K-Means Algorithm 73  Unlike K-means, in EM, the clusters are not limited to spherical shapes. In EM we can constrain the algorithm to provide different covariance matrices (spherical, diagonal and generic).  These different covariance matrices in return allow us to control the shape of our clusters and hence we can detect sub-populations in our data with different characteristics.