Bayseian decision theory

Lecture 2.
Bayesian Decision Theory
Bayes Decision Rule
Loss function
Decision surface
Multivariate normal and Discriminant Function

Bayes Decision
It is the decision making when all underlying probability
distributions are known.
It is optimal given the distributions are known.
For two classes ω1 and ω2 ,
Prior probabilities for an unknown new observation:
P(ω1) : the new observation belongs to class 1
P(ω2) : the new observation belongs to class 2
P(ω1 ) + P(ω2 ) = 1
It reflects our prior knowledge. It is our decision rule
when no feature on the new object is available:
Classify as class 1 if P(ω1 ) > P(ω2 )

Bayes Decision
We observe features on each object.
P(x| ω1) & P(x| ω2) : class-specific density
The Bayes rule:

Bayes Decision
Likelihood of
observing x given
class label.

Bayes Decision
Posterior
probabilities.

Loss function
Loss function:
probability statement --> decision
some classification mistakes can be more costly than
others.
The set of c classes:
The set of possible actions:
: deciding that an observation belongs to
Loss when taking action i given the observation belongs to
hidden class j:

Loss function
The expected loss:
Given an observation with covariant vector x, the conditional
risk is:
Our final goal is to minimize the total risk over all x.

Loss function
The zero-one loss:
All errors are equally costly.
The conditional risk is:
“The risk corrsponding to this loss function is the average
probability error.”
R(αi | x)= λ(αi |ωj)P(ωj | x)
j=1
j=c
∑
= P(ωj | x)=1−P(ωi | x)
j≠i
∑
c,...,1j,i
ji1
ji0
),( ji =



≠
=
=ωαλ

Loss function
Let denote the loss for deciding class i
when the true class is j
In minimizing the risk, we decide class one if
Rearrange it, we have

Loss function
λλ θ
ω
ω
ωθ
ω
ω
λλ
λλ
>=
−
−
)|x(P
)|x(P
:ifdecidethen
)(P
)(P
.Let
2
1
1
1
2
1121
2212
λ =
0 1
1 0





,
then θλ =
P(ω2 )
P(ω1)
= θa
λ =
0 2
1 0






then θλ =
2P(ω2 )
P(ω1)
= θb
Example:

Loss function
Likelihood ratio.
Zero-one loss
function
If miss-
classifying ω2 is
penalized more:

Discriminant function & decision surface
Features -> discriminant functions gi(x), i=1,…,c
Assign class i if gi(x) > gj(x) ∀j ≠ i
Decision
surface
defined by
gi(x) = gj(x)

Decision surface
The discriminant functions help partition the feature space
into c decision regions (not necessarily contiguous). Our
interest is to estimate the boundaries between the regions.

Minimax
Minimizing the
maximum possible
loss.
What happens when
the priors change?

Normal density
Reminder: the covariance matrix is symmetric and
positive semidefinite.
Entropy - the measure of uncertainty
Normal distribution has the maximum entropy over all
distributions with a given mean and variance.

Reminder of some results for random vectors
Let Σ be a kxk square symmetrix matrix, then it has k pairs of
eigenvalues and eigenvectors. A can be decomposed as:
Σ=λ1e1e1
′+λ2e2e2
′+.......+λkekek
′=PΛ′P
Positive-definite matrix:
′xΣx >0,∀x ≠0
λ1 ≥λ2 ≥......≥λk >0
Note: ′xΣx =λ1( ′xe1)2
+......+λk( ′xek)2

Normal density
Whitening transform:
P : eigen vector matrix
Λ : diagonal eigen value matrix
Aw = PΛ
− 1
2
Aw
t
ΣAw
= Λ
− 1
2
Pt
ΣPΛ
− 1
2
= Λ
− 1
2
Pt
PΛPt
PΛ
− 1
2
= I
Σ=λ1e1e1
′+λ2e2e2
′+.......+λkekek
′=PΛ′P

Normal density
To make a minimum error rate classification (zero-one loss),
we use discriminant functions:
This is the log of the numerator in the Bayes formula. The
log posterior probability is proportional to it. Log is used
because we are only comparing the gi’s, and log is
monotone.
When normal density is assumed:
We have:

Discriminant function for normal density
(1)Σi = σ2
I
Linear discriminant
function:
Note: blue boxes –
irrelevant terms.

The decision surface is where
With equal prior, x0 is the middle point between the two
means.
The decision surface is a hyperplane,perpendicular to the
line between the means.

“Linear machine”: dicision surfaces are hyperplanes.

With unequal prior
probabilities, the
decision boundary
shifts to the less likely
mean.

(2) Σi = Σ

Set:
The decision boundary is:

The hyperplane is
generally not
perpendicular to the
line between the
means.

(3) Σi is arbitrary
Decision boundary is hyperquadrics (hyperplanes, pairs of
hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)
gi(x)= xt
Wix+wi
t
x+wi0
Wi =−
1
2
Σi
−1
wi =Σi
−1
µi
wi0 =−
1
2
µi
t
Σi
−1
µi −
1
2
lnΣi +lnP(ωi)

Discriminant function
for normal density

Discriminant
function for
normal density

Extention to multi-class.

Discriminant function for discrete features
Discrete features: x = [x1, x2, …, xd ]t
, xi∈{0,1 }
pi = P(xi = 1 | ω1)
qi = P(xi = 1 | ω2)
The likelihood will be:

The discriminant function:
The likelihood ratio:

g(x) = wi
i=1
d
∑ xi + w0
wi = ln
pi(1−qi)
qi(1− pi)
i =1,...,d
w0 = ln
1− pi
1−qii=1
d
∑ + ln
P(ω1)
P(ω2)
So the decision surface is again a hyperplane.

Optimality
Consider a two-class case.
Two ways to make a mistake in the classification:
Misclassifying an observation from class 2 to class 1;
Misclassifying an observation from class 1 to class 2.
The feature space is partitioned into two regions by any
classifier: R1 and R2

Optimality
In the multi-class case, there are numerous ways to make
mistakes. It is easier to calculate the probability of correct
classification.
Bayes classifier maximizes P(correct). Any other partitioning
will yield higher probability of error.
The result is not dependent on the form of the underlying
distributions.

Bayseian decision theory

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Bayseian decision theory

Similar to Bayseian decision theory (20)

Recently uploaded

Recently uploaded (20)

Bayseian decision theory