1.
Machine Learning
Central Problem of Pattern Recognition:
Supervised and Unsupervised Learning
Classiﬁcation
Bayesian Decision Theory
Perceptrons and SVMs
Clustering
Visual Computing: Joachim M. Buhmann — Machine Learning 143/196
2.
Machine Learning – What is the Challenge?
Find optimal structure in data and validate it!
Concept for Robust Data Analysis
Structure Structure
Data Structure optimization Validation
vectors, relations, definition multiscale analysis,
statistical
images,... (costs, risk, ...) stochastic
approximation learning theory
Quantization of
x solution space
Regularization Information/Rate
Distortion Theory
Feedback of statistical &
computational complexity
8 March 2006 Joachim M. Buhmann / Institute for Computational Science 3
Visual Computing: Joachim M. Buhmann — Machine Learning 144/196
3.
The Problem of Pattern Recognition
Machine Learning (as statistics) addresses a number of chal
lenging inference problems in pattern recognition which span
the range from statistical modeling to efﬁcient algorithmics.
Approximative method which yield good performance on ave
rage are particularly important.
• Representation of objects. ⇒ Data representation
• What is a pattern? Deﬁnition/modeling of structure.
• Optimization: Search for prefered structures
• Validation: are the structures indeed in the data or are they
explained by ﬂuctuations?
Visual Computing: Joachim M. Buhmann — Machine Learning 145/196
4.
Literatur
• Richard O. Duda, Peter E. Hart & David G. Stork, Pattern Classiﬁcation.
Wiley & Sons (2001)
• Trevor Hastie, Robert Tibshirani & Jerome Friedman, The Elements of
Statistical Learning: Data Mining, Inference and Prediction. Springer Ver
lag (2001)
• Luc Devroye, Laslo Gyorﬁ & Gabor Lugosi, A Probabilistic Theory of Pat
¨
tern Recognition. Springer Verlag (1996)
• Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Da
ta. Springer Verlag (1983); The Nature of Statistical Learning Theory.
Springer Verlag (1995)
• Larry Wasserman, All of Statistics. (1st ed. 2004. Corr. 2nd printing,
ISBN: 0387402721) Springer Verlag (2004)
Visual Computing: Joachim M. Buhmann — Machine Learning 146/196
5.
The Classiﬁcation Problem
Visual Computing: Joachim M. Buhmann — Machine Learning 147/196
6.
Visual Computing: Joachim M. Buhmann — Machine Learning 148/196
7.
Classiﬁcation as a Pattern Recognition Problem
Problem: We look for a partition of the object space O (ﬁsh
in the previous example) which corresponds to classiﬁcation
examples.
Distinguish conceptually between “objects” o ∈ O and “data” x ∈ X !
Data: pairs of feature vectors and class labels
Z = {(xi, yi) : 1 ≤ i ≤ n, xi ∈ Rd, yi ∈ {1, . . . , k}}
Deﬁnitions: feature space X with xi ∈ X ⊂ Rd
class labels yi ∈ {1, . . . , k}
Classiﬁer: mapping c : X → {1, . . . , k}
k class problem: What is yn+1 ∈ {1, . . . , k} for xn+1 ∈ Rd?
Visual Computing: Joachim M. Buhmann — Machine Learning 149/196
8.
Example of Classiﬁcation
Visual Computing: Joachim M. Buhmann — Machine Learning 150/196
9.
Histograms of Length Values
salmon sea bass
count
22
20
18
16
12
10
8
6
4
2
0 length
5 10 15 20 25
l*
GURE 1.2. Histograms for the length feature for the two categories. No single thresh
d value of the length will serve to unambiguously discriminate between the two cat
Visual Computing: Joachim M. Buhmann — Machine Learning 151/196
ories; using length alone, we will have some errors. The value marked l ∗ will lead to
10.
Histograms of Skin Brightness Values
count
14 salmon sea bass
12
10
8
6
4
2
0 lightness
2 4 x* 6 8 10
URE 1.3. Histograms for the lightness feature for the two categories. No single
hold value x ∗ (decision boundary) will serve to unambiguously discriminate be
Visual Computing: Joachim M. Buhmann — Machine Learning 152/196
n the two categories; using lightness alone, we will have some errors. The value x ∗
11.
Linear Classiﬁcation
width
22 salmon sea bass
21
20
19
18
17
16
15
14 lightness
2 4 6 8 10
URE 1.4. The two features of lightness and width for sea bass and salmon. The d
could serve as Joachim M. Buhmann — Machine Learning our classiﬁer. Overall classiﬁcation error
Visual Computing: a decision boundary of 153/196
12.
Overﬁtting
width
22 salmon sea bass
21
20
19
18
?
17
16
15
14 lightness
2 4 6 8 10
1.5. Overly complex models for the ﬁsh will lead to decision bounda
plicated. While such a decision may lead to perfect classiﬁcation of our
Visual Computing: Joachim M. Buhmann — Machine Learning 154/196
it would lead to poor performance on future patterns. The novel te
13.
Optimized NonLinear Classiﬁcation
width
22 salmon sea bass
21
20
19
18
17
16
15
14 lightness
2 4 6 8 10
1.6. The razor argument: Entia non sunt multiplicanda praeter necessitatem! optimal trad
Occam’s decision boundary shown might represent the
erformance on the training set and simplicity of classiﬁer, thereby gi
Visual Computing: Joachim M. Buhmann — Machine Learning 155/196
ccuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and D
14.
Regression
(see Introduction to Machine Learning)
Question: Given a feature
(vector) xi and a corre
sponding noisy measure
ment of a function value
yi = f (xi) + noise, what is
the unknown function f (.)
in a hypothesis class H?
Data: Z = {(xi, yi) ∈ Rd × R : 1 ≤ i ≤ n}
Modeling choice: What is an adequate hypothesis class and
a good noise model? Fitting with linear/nonlinear functions?
Visual Computing: Joachim M. Buhmann — Machine Learning 156/196
15.
The Regression Function
Questions: (i) What is the statistically optimal estimate of a
function f : Rd → R and (ii) which algorithm achieves this
goal most efﬁciently?
Solution to (i): the regression function
y(x) = E {yX = x} = y p(yX = x)dy
Ω
Nonlinear regression of a
sinc function
sinc(x) := sin(x)/x
(gray) with a regression ﬁt
(black) based on 50 noisy
data.
Visual Computing: Joachim M. Buhmann — Machine Learning 157/196
16.
Examples of linear and nonlinear regression
linear regression nonlinear regression
How should we measure the deviations?
vertical offsets perpendicular offsets
Visual Computing: Joachim M. Buhmann — Machine Learning 158/196
17.
Core Questions of Pattern Recognition:
Unsupervised Learning
No teacher signal is available for the learning algorithm; lear
ning is guided by a general cost/risk function.
Examples for unsupervised learning
1. data clustering, vector quantization:
as in classiﬁcation we search for a partitioning of objects in
groups; but explicit labelings are not available.
2. hierarchical data analysis; search for tree structures in data
3. visualisation, dimension reduction
Semisupervised learning: some of the data are labeled, most
of them are unlabeled.
Visual Computing: Joachim M. Buhmann — Machine Learning 159/196
18.
Modes of Learning
Reinforcement Learning: weakly supervised learning
Action chains are evaluated at the end.
Backgammon; the neural network TDGammon gained the
world championship! Quite popular in Robotics
Active Learning: Data are selected according to their expec
ted information gain.
Information Filtering
Inductive Learning: the learning algorithm extracts logical ru
les from the data.
Inductive Logic Programming is a popular sub area of Artiﬁ
cial Intelligence
Visual Computing: Joachim M. Buhmann — Machine Learning 160/196
19.
Vectorial Data
Data of 20 Gaussian
1
G
sources in R20, pro
M
L L L E M
jected onto two di
LL ME M J
G
0.5 L LE
L LL
M
EEE M
MM
J
J mensions with Princi
E J G
NN MJ G GGGJ J
J
E
E
N
KJ G
G pal Component Ana
H G KK K
C
N
E N H
N T
Q
H
K Q
K
K
K
K
lysis.
B N T
H K
D C CC NT
HT H R
0 D N NT H Q
P D CD T
P D C H RRQ IR
IQ I I
D C
D P P T T H R I I
P PD T I IR
PDD P BB H I R
C Q Q R I
C B BBB S SS S R RQ Q
P
C P B BBS O F FQ
SSO S F F FF
F
O S A
0.5 O S A FF
O OO
O OO A AA A
A A
A
A
1
1 0.5 0 0.5 1
Visual Computing: Joachim M. Buhmann — Machine Learning 161/196
20.
Relational Data
Pairwise dissimilarity
of 145 globins which
have been selected
from 4 classes of
αglobine, βglobine,
myoglobins and glo
bins of insects and
plants.
Visual Computing: Joachim M. Buhmann — Machine Learning 162/196
21.
Scales for Data
Nominal or categorial scale: qualitative, but without quantita
tive measurements,
e.g. binary scale F = {0, 1} (presence or absence of proper
ties like “kosher”) or
taste categories “sweet, sour, salty and bitter.
Ordinal scale : measurement values are meaningful only with
respect to other measurements, i.e., the rank order of mea
surements carries the information, not the numerical diffe
rences (e.g. information on the ranking of different marathon
races!?)
Visual Computing: Joachim M. Buhmann — Machine Learning 163/196
22.
Quantitative scale:
• interval scale: the relation of numerical differences car
ries the information. Invariance w.r.t. translation and sca
ling (Fahrenheit scale of temperature).
• ratio scale: zero value of the scale carries information but
not the measurement unit. (Kelvin scale).
• Absolute scale: Absolute values are meaningful. (grades
of ﬁnal exams)
Visual Computing: Joachim M. Buhmann — Machine Learning 164/196
23.
Machine Learning: Topic Chart
• Core problems of pattern recognition
• Bayesian decision theory
• Perceptrons and Support vector machines
• Data clustering
Visual Computing: Joachim M. Buhmann — Machine Learning 165/196
24.
Bayesian Decision Theory
The Problem of Statistical Decisions
Task: textbf n objects have to be partitioned in 1, . . . , k classes,
the doubt class D and the outlier class O.
D : doubt class (→ new measurements required)
O : outlier class, deﬁnitively none of the classes 1, 2, . . . , k
Objects are characterized by feature vectors X ∈ X , X ∼
P(X) with the probability P(X = x) of feature values x.
Statistical modeling: Objects represented by data X and
classes Y are considered to be random variables, i.e.,
(X, Y ) ∼ P(X, Y ).
Conceptually, it is not mandatory to consider class labels as random since they might
be induced by legal considerations or conventions.
Visual Computing: Joachim M. Buhmann — Machine Learning 166/196
25.
Structure of the feature space X
• X ⊂ Rd
• X = X1 × X2 × · · · × Xd with Xi ⊆ R or Xi ﬁnite.
Remark: in most situations we can deﬁne the feature space as subsets of Rd or as
tuples of real, categorial (B = {0, 1}) or ordinal (K ⊂ K}) numbers. Sometimes we
have more complicated data spaces composed of lists, trees or graphs.
Class density / likelihood: py (x) := P(X = xY = y) is equal
to the probability of a feature value x given a class y.
Parametric Statistics: estimate the parameters of the class
densities py (x)
NonParametric Statistics: minimize the empirical risk
Visual Computing: Joachim M. Buhmann — Machine Learning 167/196
26.
Motivation of Classiﬁcation
Given are labeled data
Z = {(xi, yi) : i ≤ n}
Questions:
1. What are the class
boundaries?
2. What are the class
speciﬁc densities
py (x)?
3. How many modes
or parameters do
we need to model Figure: quadratic SVM classiﬁer for ﬁve classes.
py (x)? White areas are ambiguous regions.
4. ...
Visual Computing: Joachim M. Buhmann — Machine Learning 168/196
27.
Thomas Bayes and his Terminology
The State of Nature is modelled as a random variable!
prior: P{model}
likelihood: P{datamodel}
posterior: P{modeldata}
evidence: P{data}
P{datamodel}P{model}
Bayes Rule: P{modeldata} =
P{data}
Visual Computing: Joachim M. Buhmann — Machine Learning 169/196
28.
Ronald A. Fisher and Frequentism
Fisher, Ronald Aylmer (18901962): founder of frequentist
statistics together with Jerzey Neyman & Karl Pearson.
British mathematician and biologist who in
vented revolutionary techniques for apply
ing statistics to natural sciences.
Maximum likelihood method
Fisher information: a measure for the infor
mation content of densities.
Sampling theory
Hypothesis testing
Visual Computing: Joachim M. Buhmann — Machine Learning 170/196
29.
Bayesianism vs. Frequentist Inference1
Bayesianism is the philosophical tenet that the mathematical theory of pro
bability applies to the degree of plausibility of statements, or to the degree
of belief of rational agents in the truth of statements; together with Bayes
theorem, it becomes Bayesian inference. The Bayesian interpretation of
probability allows probabilities assigned to random events, but also al
lows the assignment of probabilities to any other kind of statement.
Bayesians assign probabilities to any statement, even when no random
process is involved, as a way to represent its plausibility. As such, the
scope of Bayesian inquiries include the scope of frequentist inquiries.
The limiting relative frequency of an event over a long series of trials is
the conceptual foundation of the frequency interpretation of probability.
Frequentism rejects degreeofbelief interpretations of mathematical pro
bability as in Bayesianism, and assigns probabilities only to random
events according to their relative frequencies of occurrence.
1
see http://encyclopedia.thefreedictionary.com/
Visual Computing: Joachim M. Buhmann — Machine Learning 171/196
30.
Bayes Rule for Known Densities and Parameters
Assume that we know how the features are distributed for the
different classes, i.e., the class conditional densities and their
parameters are known.What is the best classiﬁcation strat
egy in this situation?
Classiﬁer:
c : X → {1, . . . , k, D}
ˆ
The assignment function c maps the feature space X to the
ˆ
set of classes {1, . . . , k, D}. (Outliers are neglected)
Quality of a classiﬁer: Whenever a classiﬁer returns a label
which differs from the correct class Y = y then it has made
a mistake.
Visual Computing: Joachim M. Buhmann — Machine Learning 172/196
31.
Error count: The indicator function
I{ˆ(x)=y}
c
x∈X
counts the classiﬁer mistakes. Note that this error count is a
random variable!
Expected errors also called expected risk deﬁne the quality
of a classiﬁer
R(ˆ) =
c P(y)EP(x) I{ˆ(x)=y}Y = y + terms from D
c
y≤k
Remark: The rational behind this choice comes from gambling. If we bet on
a particular outcome of our experiment and our gain is measured by how
often we assign the measurements to the correct class then classiﬁer with
minimal expected risk will win on average against any other classiﬁcation
rule (“Dutch books”)!
Visual Computing: Joachim M. Buhmann — Machine Learning 173/196
32.
The Loss Function
Weighted mistakes are introduced when classiﬁcation errors
are not equally costly; e.g. in medical diagnosis, some di
sease classes might be harmless and others might be lethal
despite of similar symptoms.
⇒ We introduce a loss function L(y, z) which denotes the loss
for the decision z if class y is correct.
01 loss: all classes are treated the same!
0 if z = y (correct decision)
L0−1(y, z) = 1 if z = y and z = D (wrong decision)
d if z = D (no decision)
Visual Computing: Joachim M. Buhmann — Machine Learning 174/196
33.
• weighted classiﬁcation costs L(y, z) ∈ R+ are frequently
used, e.g. in medicine;
classiﬁcation costs can also be asymmetric, that means
L(y, z) = L(z, y) ((z, y) ∼ (pancreas cancer, gastritis).
Conditional Risk function of the classiﬁer is the expected
loss of class y
R(ˆ, y) = Ex [L(y, c(x))Y = y]
c ˆ
= L(y, z)P{ˆ(x) = zY = y}
c
z≤k
+L(y, D)P{ˆ(x) = DY = y}
c
= P{ˆ(x) = y ∧ c(x) = DY = y} + d · P{ˆ(x) = DY = y}
c ˆ c
pmc(y) probability of misclassiﬁcation pd(y) probability of doubt
Visual Computing: Joachim M. Buhmann — Machine Learning 175/196
34.
Total risk of the classiﬁer: (πy := P(Y = y))
R(ˆ) =
c πz pmc(z) + d πz pd(z) = EC R(ˆ, C)
c
z≤k z≤k
Asymptotic average loss
1
lim ˆ ˆ c
L(cj , c(xj )) = lim R(ˆ) = R(ˆ),
c
n→∞ n n→∞
j≤n
where {(xj , cj )1 ≤ j ≤ n} is a random sample set of size n.
This formula can be interpreted as the expected loss with empirical distribution as probability model.
Visual Computing: Joachim M. Buhmann — Machine Learning 176/196
35.
Posterior class probability
Posterior: Let
πy py (x)
p(yx) ≡ P{Y = yX = x} =
z πz pz (x)
be the posterior of the class y given X = x.
(The ‘Partition of One” πy py (x)/ z πz pz (x) results from the normalizati
on z p(zx) = 1. )
Likelihood: The class conditional density py (x) is the probabi
lity of observing data X = x given class Y = y.
Prior: πy is the probability of class Y = y.
Visual Computing: Joachim M. Buhmann — Machine Learning 177/196
36.
Bayes Optimal Classiﬁer
Theorem 1 The classiﬁcation rule which minimizes the total
risk for 0 − 1 loss is
y if p(yx) = maxz≤k p(zx) > 1 − d,
c(x) =
D if p(yx) ≤ 1 − d ∀y.
Generalization to arbitrary loss functions
y if z L(z, y)p(zx) = minρ≤k z L(z, ρ)p(zx) ≤ d,
c(x) =
D else .
Bayes classiﬁer: Select the class with highest πy py (x) value if
it exceeds the costs for not making a decision, i.e., πy py (x) >
(1 − d)p(x).
Visual Computing: Joachim M. Buhmann — Machine Learning 178/196
37.
Proof: Calculate the total expected loss R(ˆ)
c
R(ˆ) = EX EY L0−1(Y, c(x))X = x
c ˆ
= EY L0−1(Y, c(x))X = x p(x)dx with p(x) =
ˆ πz pz (x)
X z≤k
Minimize the conditional expectation value since it depends only on c.
ˆ
c(x) = argminc∈{1,...,k,D}E L0−1(Y, c)X = x
ˆ ˜ ˜
= argminc∈{1,...,k,D}
˜ L0−1(z, c)p(zx)
˜
z≤k
argminc∈{1,...,k} (1 − p(˜x)) if d > minc(1 − p(cx))
˜ c
=
D else
argmaxc∈{1,...,k}p(˜x) if 1 − d < maxc p(cx)
˜ c
=
D else
Visual Computing: Joachim M. Buhmann — Machine Learning 179/196
38.
Outliers
• Modeling by an outlier class πO with pO (x)
• “Novelty Detection”: Classify a measurement as an outlier
if
πO pO (x) ≥ max (1 − d)p(x), max πz pz (x)
z
• The outlier concept causes conceptual problems and it does not ﬁt to the
statistical decision theory since outliers indicate an erroneous or incom
plete speciﬁcation of the statistical model!
• The outlier class is often modeled by a uniform distribution.
Attention: Normalization of uniform distribution does not exist in many
feature spaces!
=⇒ Limit the support of the measurement space or put a (Gaussian)
measure on it!
Visual Computing: Joachim M. Buhmann — Machine Learning 180/196
39.
Class Conditional Densities and Posteriors for 2
Classes
Classconditional probability den Posterior probabilities for priors
sity function P(y1) = 2 , P(y2) = 1 .
3 3
p(xωi) P(ωix)
0.4 1
ω2
ω1
0.8
0.3 ω1
0.6
0.2
0.4
ω2
0.1
0.2
x x
9 10 11 12 13 14 15 9 10 11 12 13 14 15
GURE 2.1. Hypothetical classconditional probability density FIGURE 2.2. Posterior probabilities for the particular priors P (ω1 ) = 2/3 and
functions show the
obability density of measuring a particular feature value x given1/3 for the classconditional probability densities shown in Fig. 2.1. Thus
= the pattern is in
tegory ωi . If x represents the lightness of a ﬁsh, the two curves might describea the
case, given that pattern is measured to have feature value x = 14, the probabi
fference in lightness of populations of two types of ﬁsh. Density functions areω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x , the posterio
in category normal
ed, and thus the area under each curve is 1.0. From: Richard O.to 1.0. From: E. Hart, O. Duda, Peter E. Hart, and David G. Stork, Pattern Classiﬁ
Duda, Peter Richard
nd David G. Stork, Pattern Classiﬁcation. Copyright c 2001 byCopyright c & Sons, John Wiley & Sons, Inc.
John Wiley 2001 by
c. Visual Computing: Joachim M. Buhmann — Machine Learning 181/196
40.
Likelihood Ratio for 2 Class Example
p(xω1)
p(xω2)
θb
θa
x
R2 R1 R2 R1
RE 2.3. The likelihood ratio p(x ω1 )/p(x ω2 ) for the distributions show
1. IfVisual Computing: Joachim M.zeroone or Learning
we employ a Buhmann — Machine classiﬁcation loss, our decision boundari
182/196
41.
Discriminant Functions gl
action
(e.g., classification)
costs
discriminant g1(x) g2(x) ... gc(x)
functions
input x1 x2 x3 ... xd
FIGURE 2.5. The functional structure of a general statistical pattern classiﬁer which
includes d inputs and c discriminant functions g (x). A subsequent step determines
• Discriminant function: gz (x) = P{Y = yXi = x}categorizes the input pattern
which of the discriminant values is the maximum, and
accordingly. The arrows show the direction of the ﬂow of information, though frequently
• Class decision: gy (x) > gzthe direction of ﬂow is selfevident. From: Richard O.
the arrows are omitted when (x) ∀z = y ⇒ class y.
Duda, Peter E. Hart, and David G. Stork, Pattern Classiﬁcation. Copyright c 2001 by
John Wiley & Sons, Inc.
• Different discriminant functions can yield the same decision:
gy (x) = log P{xy} + log πy ; minimize implementation problems!
˜
Visual Computing: Joachim M. Buhmann — Machine Learning 183/196
42.
Example for Discriminant Functions
0.3
p(xω1)P(ω1) p(xω2)P(ω2)
0.2
0.1
0
R1 R2
R2
decision 5
boundary
5
0
0
GURE 2.6. In this twodimensional twocategory classiﬁer, the probability densitie
e Gaussian, the Joachim M. Buhmann — Machineconsists of two hyperbolas, and thus the decisio
Visual Computing: decision boundary Learning 184/196
43.
Adaptation of Discriminant Functions gl
teacher
action signal
(e.g., classification)

MAX
discriminant g (x ) g2(x ) . . . gc(x )
1
functions
input x1 x2 x3 ... xd
The red connections (weights) are adapted in such a way that the teacher
signal is imitated by the discriminant function.
Visual Computing: Joachim M. Buhmann — Machine Learning 185/196
44.
Example Discriminant Functions: Normal
Distributions
The Likelihood of class y is Gaussian distributed.
1 1
py (x) = exp − (x − µy )T Σ−1(x − µy )
y
(2π)dΣy  2
Special case: Σy = σ 2I
gy (x) = log py (x) + log πy
1
= − 2 x − µy 2 + log πy + const.
2σ
Visual Computing: Joachim M. Buhmann — Machine Learning 186/196
45.
⇒ Decision surface between class z and y:
1 1
− 2 x − µz 2 + log πz = − 2 x − µy 2 + log πy
2σ 2σ
− x 2 + 2x · µz − µz 2 + 2σ 2 log πz = − x 2 + 2x · µy − µy 2 + 2σ 2 log πy
2 2 2 πz
⇒ 2x · (µz − µy ) − µz + µy + 2σ log =0
πy
Linear decision rule: wT (x − x0) = 0
1 σ 2(µz − µy ) πz
with w = µz − µy x0 = (µz + µy ) − 2
log
2 µz − µy πy
Visual Computing: Joachim M. Buhmann — Machine Learning 187/196
46.
Decision Surface for Gaussians in 1,2,3
Dimensions
4
2 2
0
2 ω2 1
0.15
ω1
0
0.1 P(ω2)=.5
p(xωi)
ω1 ω2 2
0.4 ω2
0.05
1
0.3 0 ω1
0 R2
0.2
1
0.1 P(ω2)=.5
P(ω1)=.5 R2 2 P(ω1)=.5 R1
x R1 2
2 0 2 4
2 1
0 0
R1 R2 2 1
P(ω1)=.5 P(ω2)=.5 4 2
FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identity
matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of
d − 1 dimensions, perpendicular to the line separating the means. In these one, two, and threedimensional
examples, we indicate p(xωi ) and the boundaries for the case P (ω1 ) = P (ω2 ). In the threedimensional case,
the grid plane separates R1 from R2 . From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern
Classiﬁcation. Copyright c 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 188/196
47.
p(xωi) p(xωi)
ω1 ω2 ω1 ω2
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
x x
2 0 2 4 2 0 2 4
R1 R2 R1 R2
P(ω1)=.7 P(ω2)=.3 P(ω1)=.9 P(ω2)=.1
4 4
2 2
0 0
2 2
ω2
ω1 ω2 ω1
0.15 0.15
0.1 0.1
0.05 0.05
0 0
P(ω2)=.01
P(ω2)=.2
R2
P(ω1)=.8 R2
P(ω1)=.99
R1
2 2 R1
0 0
2 2
4 4
3 4
2
1 2
0 P(ω2)=.2 0
2 2 R2
R2
R1
R1
1 ω2 1
ω1
ω1 P(ω2)=.01
0 0
ω2
1 1
P(ω1)=.8 P(ω1)=.99
2 2
2 2
1 1
0 0
1 1
2 2
FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufﬁciently
disparate priors the boundary will not lie between the means of these one, two and
threedimensional spherical Gaussian distributions. From: Richard O. Duda, Peter E.
Hart, and David G. Stork, Pattern Classiﬁcation. Copyright c 2001 by John Wiley &
Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 189/196
48.
Multi Class Case
R4 R3
R2 R4
R1
E 2.16. The decision regions for four normal distributions. Even with such a
r of Decision regions for four Gaussianboundary regionsfor such arathernum
categories, the shapes of the distributions. Even can be small complex. F
d O. Duda, Peter E.discriminant functionsG. Stork, Pattern Classiﬁcation. Copy
ber of classes the Hart, and David show a complex form.
1 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 190/196
49.
Example: Gene Expression Data
The expression of genes is measured for various patients. The
expression proﬁles provide information of the metabolic state of
the cells, meaning that they could be used as indicators for di
sease classes. Each patient is represented as a vector in a high
dimensional (≈ 10000) space with Gaussian class distribution.
Genes
ALL B−Cell
Samples
AML ALL T−Cell
Pred
True
Visual Computing: Joachim M. Buhmann — Machine Learning 191/196
50.
Parametric Models for Class Densities
If we would know the prior probabilities and the class conditio
nal probabilities then we could calculate the optimal classiﬁer.
But we don’t!
Task: Estimate p(yx; θ) from samples Z = {(x1, y1), . . . , (xn, yn)}
for classiﬁcation.
Data are sorted according to their classes:
Xy = {X1y , . . . , Xny ,y } where Xiy ∼ P{XY = y; θy }
Question: How can we use the information in samples to esti
mate θy ?
Assumption: classes can be separated and treated indepen
dently! Xy is not informative w.r.t. θz , z = y
Visual Computing: Joachim M. Buhmann — Machine Learning 192/196
51.
Maximum Likelihood Estimation Theory
Likelihood of the data set: P{Xy θy } = i≤ny p(xiy θy )
ˆ
Estimation principle: Select the parameters θy which maximi
ze the likelihood, that means
ˆ
θy = arg max P{Xy θy }
θy
Procedure: Find the extreme value of the loglikelihood functi
on
θy log P{X θy } = 0
∂
log p(xiθy ) = 0
∂θy
i≤n
Visual Computing: Joachim M. Buhmann — Machine Learning 193/196
52.
Remark
Bias of an estimator: ˆ ˆ
bias(θn) = E{θn} − θ.
ˆ
Consistent estimator: A point estimator θn of a parameter θ
P
ˆn → θ.
is consistent if θ
Asymptotic Normality of Maximum Likelihood estimates:
ˆ ˆ
(θn − θ)/ V{θn} N (0, 1).
Alternative to ML class density estimation: discriminative
learning by maximizing the a posteriori distribution P{θy Xy }
(details of the density do not have to be modelled since they might not inﬂuence the po
sterior)
Visual Computing: Joachim M. Buhmann — Machine Learning 194/196
53.
Example: Multivariate Normal Distribution
Expectation values of a normal distribution and its estimation:
Class index has been omitted for legibility reasons (θy → θ).
1 T −1 d 1
log p(xiθ) = − (xi − µ) Σ (xi − µ) − log 2π − log Σ
2 2 2
∂ 1 1 T
log p(xiθ) = Σ−1(xi − µ) + (xi − µ)Σ−1 = 0
∂µ 2 2
i≤n i≤n i≤n
1
Σ−1 (xi − µ) = 0 ⇒ µn =
ˆ xi estimator for µ
n i
i≤n
Average value formula results from the quadratic form.
1
Unbiasedness: E[ˆn] =
µ Exi = E[x] = µ
n
i≤n
Visual Computing: Joachim M. Buhmann — Machine Learning 195/196
54.
ML estimation of the variance (1d case)
∂ ∂ 1 2 n
log p(xiθ) = − 2 xi − µ − log(2πσ 2)
∂σ 2 ∂σ σ 2 2
i≤n i≤n
1 −4 2 n −2
= σ xi − µ − σ = 0
2 2
i≤n
1
⇒ ˆ2
σn = xi − µ 2
n
i≤n
ˆn = 1
Multivariate case Σ (xi − µ)(xi − µ)T
n
i≤n
ˆ ˆ
Σn is biased, e.g., EΣn = Σ, if µ is unknown.
Visual Computing: Joachim M. Buhmann — Machine Learning 196/196
Views
Actions
Embeds 0
Report content