1. Machine Learning
Central Problem of Pattern Recognition:
Supervised and Unsupervised Learning
Classification
Bayesian Decision Theory
Perceptrons and SVMs
Clustering
Visual Computing: Joachim M. Buhmann — Machine Learning 143/196
2. Machine Learning – What is the Challenge?
Find optimal structure in data and validate it!
Concept for Robust Data Analysis
Structure Structure
Data Structure optimization Validation
vectors, relations, definition multiscale analysis,
statistical
images,... (costs, risk, ...) stochastic
approximation learning theory
Quantization of
x solution space
Regularization Information/Rate
Distortion Theory
Feedback of statistical &
computational complexity
8 March 2006 Joachim M. Buhmann / Institute for Computational Science 3
Visual Computing: Joachim M. Buhmann — Machine Learning 144/196
3. The Problem of Pattern Recognition
Machine Learning (as statistics) addresses a number of chal-
lenging inference problems in pattern recognition which span
the range from statistical modeling to efficient algorithmics.
Approximative method which yield good performance on ave-
rage are particularly important.
• Representation of objects. ⇒ Data representation
• What is a pattern? Definition/modeling of structure.
• Optimization: Search for prefered structures
• Validation: are the structures indeed in the data or are they
explained by fluctuations?
Visual Computing: Joachim M. Buhmann — Machine Learning 145/196
4. Literatur
• Richard O. Duda, Peter E. Hart & David G. Stork, Pattern Classification.
Wiley & Sons (2001)
• Trevor Hastie, Robert Tibshirani & Jerome Friedman, The Elements of
Statistical Learning: Data Mining, Inference and Prediction. Springer Ver-
lag (2001)
• Luc Devroye, Laslo Gyorfi & Gabor Lugosi, A Probabilistic Theory of Pat-
¨
tern Recognition. Springer Verlag (1996)
• Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Da-
ta. Springer Verlag (1983); The Nature of Statistical Learning Theory.
Springer Verlag (1995)
• Larry Wasserman, All of Statistics. (1st ed. 2004. Corr. 2nd printing,
ISBN: 0-387-40272-1) Springer Verlag (2004)
Visual Computing: Joachim M. Buhmann — Machine Learning 146/196
7. Classification as a Pattern Recognition Problem
Problem: We look for a partition of the object space O (fish
in the previous example) which corresponds to classification
examples.
Distinguish conceptually between “objects” o ∈ O and “data” x ∈ X !
Data: pairs of feature vectors and class labels
Z = {(xi, yi) : 1 ≤ i ≤ n, xi ∈ Rd, yi ∈ {1, . . . , k}}
Definitions: feature space X with xi ∈ X ⊂ Rd
class labels yi ∈ {1, . . . , k}
Classifier: mapping c : X → {1, . . . , k}
k class problem: What is yn+1 ∈ {1, . . . , k} for xn+1 ∈ Rd?
Visual Computing: Joachim M. Buhmann — Machine Learning 149/196
9. Histograms of Length Values
salmon sea bass
count
22
20
18
16
12
10
8
6
4
2
0 length
5 10 15 20 25
l*
GURE 1.2. Histograms for the length feature for the two categories. No single thresh-
d value of the length will serve to unambiguously discriminate between the two cat-
Visual Computing: Joachim M. Buhmann — Machine Learning 151/196
ories; using length alone, we will have some errors. The value marked l ∗ will lead to
10. Histograms of Skin Brightness Values
count
14 salmon sea bass
12
10
8
6
4
2
0 lightness
2 4 x* 6 8 10
URE 1.3. Histograms for the lightness feature for the two categories. No single
hold value x ∗ (decision boundary) will serve to unambiguously discriminate be-
Visual Computing: Joachim M. Buhmann — Machine Learning 152/196
n the two categories; using lightness alone, we will have some errors. The value x ∗
11. Linear Classification
width
22 salmon sea bass
21
20
19
18
17
16
15
14 lightness
2 4 6 8 10
URE 1.4. The two features of lightness and width for sea bass and salmon. The d
could serve as Joachim M. Buhmann — Machine Learning our classifier. Overall classification error
Visual Computing: a decision boundary of 153/196
12. Overfitting
width
22 salmon sea bass
21
20
19
18
?
17
16
15
14 lightness
2 4 6 8 10
1.5. Overly complex models for the fish will lead to decision bounda
plicated. While such a decision may lead to perfect classification of our
Visual Computing: Joachim M. Buhmann — Machine Learning 154/196
it would lead to poor performance on future patterns. The novel te
13. Optimized Non-Linear Classification
width
22 salmon sea bass
21
20
19
18
17
16
15
14 lightness
2 4 6 8 10
1.6. The razor argument: Entia non sunt multiplicanda praeter necessitatem! optimal trad
Occam’s decision boundary shown might represent the
erformance on the training set and simplicity of classifier, thereby gi
Visual Computing: Joachim M. Buhmann — Machine Learning 155/196
ccuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and D
14. Regression
(see Introduction to Machine Learning)
Question: Given a feature
(vector) xi and a corre-
sponding noisy measure-
ment of a function value
yi = f (xi) + noise, what is
the unknown function f (.)
in a hypothesis class H?
Data: Z = {(xi, yi) ∈ Rd × R : 1 ≤ i ≤ n}
Modeling choice: What is an adequate hypothesis class and
a good noise model? Fitting with linear/nonlinear functions?
Visual Computing: Joachim M. Buhmann — Machine Learning 156/196
15. The Regression Function
Questions: (i) What is the statistically optimal estimate of a
function f : Rd → R and (ii) which algorithm achieves this
goal most efficiently?
Solution to (i): the regression function
y(x) = E {y|X = x} = y p(y|X = x)dy
Ω
Nonlinear regression of a
sinc function
sinc(x) := sin(x)/x
(gray) with a regression fit
(black) based on 50 noisy
data.
Visual Computing: Joachim M. Buhmann — Machine Learning 157/196
16. Examples of linear and nonlinear regression
linear regression nonlinear regression
How should we measure the deviations?
vertical offsets perpendicular offsets
Visual Computing: Joachim M. Buhmann — Machine Learning 158/196
17. Core Questions of Pattern Recognition:
Unsupervised Learning
No teacher signal is available for the learning algorithm; lear-
ning is guided by a general cost/risk function.
Examples for unsupervised learning
1. data clustering, vector quantization:
as in classification we search for a partitioning of objects in
groups; but explicit labelings are not available.
2. hierarchical data analysis; search for tree structures in data
3. visualisation, dimension reduction
Semisupervised learning: some of the data are labeled, most
of them are unlabeled.
Visual Computing: Joachim M. Buhmann — Machine Learning 159/196
18. Modes of Learning
Reinforcement Learning: weakly supervised learning
Action chains are evaluated at the end.
Backgammon; the neural network TD-Gammon gained the
world championship! Quite popular in Robotics
Active Learning: Data are selected according to their expec-
ted information gain.
Information Filtering
Inductive Learning: the learning algorithm extracts logical ru-
les from the data.
Inductive Logic Programming is a popular sub area of Artifi-
cial Intelligence
Visual Computing: Joachim M. Buhmann — Machine Learning 160/196
19. Vectorial Data
Data of 20 Gaussian
1
G
sources in R20, pro-
M
L L L E M
jected onto two di-
LL ME M J
G
0.5 L LE
L LL
M
EEE M
MM
J
J mensions with Princi-
E J G
NN MJ G GGGJ J
J
E
E
N
KJ G
G pal Component Ana-
H G KK K
C
N
E N H
N T
Q
H
K Q
K
K
K
K
lysis.
B N T
H K
D C CC NT
HT H R
0 D N NT H Q
P D CD T
P D C H RRQ IR
IQ I I
D C
D P P T T H R I I
P PD T I IR
PDD P BB H I R
C Q Q R I
C B BBB S SS S R RQ Q
P
C P B BBS O F FQ
SSO S F F FF
F
O S A
-0.5 O S A FF
O OO
O OO A AA A
A A
A
A
-1
-1 -0.5 0 0.5 1
Visual Computing: Joachim M. Buhmann — Machine Learning 161/196
20. Relational Data
Pairwise dissimilarity
of 145 globins which
have been selected
from 4 classes of
α-globine, β-globine,
myoglobins and glo-
bins of insects and
plants.
Visual Computing: Joachim M. Buhmann — Machine Learning 162/196
21. Scales for Data
Nominal or categorial scale: qualitative, but without quantita-
tive measurements,
e.g. binary scale F = {0, 1} (presence or absence of proper-
ties like “kosher”) or
taste categories “sweet, sour, salty and bitter.
Ordinal scale : measurement values are meaningful only with
respect to other measurements, i.e., the rank order of mea-
surements carries the information, not the numerical diffe-
rences (e.g. information on the ranking of different marathon
races!?)
Visual Computing: Joachim M. Buhmann — Machine Learning 163/196
22. Quantitative scale:
• interval scale: the relation of numerical differences car-
ries the information. Invariance w.r.t. translation and sca-
ling (Fahrenheit scale of temperature).
• ratio scale: zero value of the scale carries information but
not the measurement unit. (Kelvin scale).
• Absolute scale: Absolute values are meaningful. (grades
of final exams)
Visual Computing: Joachim M. Buhmann — Machine Learning 164/196
23. Machine Learning: Topic Chart
• Core problems of pattern recognition
• Bayesian decision theory
• Perceptrons and Support vector machines
• Data clustering
Visual Computing: Joachim M. Buhmann — Machine Learning 165/196
24. Bayesian Decision Theory
The Problem of Statistical Decisions
Task: textbf n objects have to be partitioned in 1, . . . , k classes,
the doubt class D and the outlier class O.
D : doubt class (→ new measurements required)
O : outlier class, definitively none of the classes 1, 2, . . . , k
Objects are characterized by feature vectors X ∈ X , X ∼
P(X) with the probability P(X = x) of feature values x.
Statistical modeling: Objects represented by data X and
classes Y are considered to be random variables, i.e.,
(X, Y ) ∼ P(X, Y ).
Conceptually, it is not mandatory to consider class labels as random since they might
be induced by legal considerations or conventions.
Visual Computing: Joachim M. Buhmann — Machine Learning 166/196
25. Structure of the feature space X
• X ⊂ Rd
• X = X1 × X2 × · · · × Xd with Xi ⊆ R or Xi finite.
Remark: in most situations we can define the feature space as subsets of Rd or as
tuples of real, categorial (B = {0, 1}) or ordinal (K ⊂ K}) numbers. Sometimes we
have more complicated data spaces composed of lists, trees or graphs.
Class density / likelihood: py (x) := P(X = x|Y = y) is equal
to the probability of a feature value x given a class y.
Parametric Statistics: estimate the parameters of the class
densities py (x)
Non-Parametric Statistics: minimize the empirical risk
Visual Computing: Joachim M. Buhmann — Machine Learning 167/196
26. Motivation of Classification
Given are labeled data
Z = {(xi, yi) : i ≤ n}
Questions:
1. What are the class
boundaries?
2. What are the class
specific densities
py (x)?
3. How many modes
or parameters do
we need to model Figure: quadratic SVM classifier for five classes.
py (x)? White areas are ambiguous regions.
4. ...
Visual Computing: Joachim M. Buhmann — Machine Learning 168/196
27. Thomas Bayes and his Terminology
The State of Nature is modelled as a random variable!
prior: P{model}
likelihood: P{data|model}
posterior: P{model|data}
evidence: P{data}
P{data|model}P{model}
Bayes Rule: P{model|data} =
P{data}
Visual Computing: Joachim M. Buhmann — Machine Learning 169/196
28. Ronald A. Fisher and Frequentism
Fisher, Ronald Aylmer (1890-1962): founder of frequentist
statistics together with Jerzey Neyman & Karl Pearson.
British mathematician and biologist who in-
vented revolutionary techniques for apply-
ing statistics to natural sciences.
Maximum likelihood method
Fisher information: a measure for the infor-
mation content of densities.
Sampling theory
Hypothesis testing
Visual Computing: Joachim M. Buhmann — Machine Learning 170/196
29. Bayesianism vs. Frequentist Inference1
Bayesianism is the philosophical tenet that the mathematical theory of pro-
bability applies to the degree of plausibility of statements, or to the degree
of belief of rational agents in the truth of statements; together with Bayes
theorem, it becomes Bayesian inference. The Bayesian interpretation of
probability allows probabilities assigned to random events, but also al-
lows the assignment of probabilities to any other kind of statement.
Bayesians assign probabilities to any statement, even when no random
process is involved, as a way to represent its plausibility. As such, the
scope of Bayesian inquiries include the scope of frequentist inquiries.
The limiting relative frequency of an event over a long series of trials is
the conceptual foundation of the frequency interpretation of probability.
Frequentism rejects degree-of-belief interpretations of mathematical pro-
bability as in Bayesianism, and assigns probabilities only to random
events according to their relative frequencies of occurrence.
1
see http://encyclopedia.thefreedictionary.com/
Visual Computing: Joachim M. Buhmann — Machine Learning 171/196
30. Bayes Rule for Known Densities and Parameters
Assume that we know how the features are distributed for the
different classes, i.e., the class conditional densities and their
parameters are known.What is the best classification strat-
egy in this situation?
Classifier:
c : X → {1, . . . , k, D}
ˆ
The assignment function c maps the feature space X to the
ˆ
set of classes {1, . . . , k, D}. (Outliers are neglected)
Quality of a classifier: Whenever a classifier returns a label
which differs from the correct class Y = y then it has made
a mistake.
Visual Computing: Joachim M. Buhmann — Machine Learning 172/196
31. Error count: The indicator function
I{ˆ(x)=y}
c
x∈X
counts the classifier mistakes. Note that this error count is a
random variable!
Expected errors also called expected risk define the quality
of a classifier
R(ˆ) =
c P(y)EP(x) I{ˆ(x)=y}|Y = y + terms from D
c
y≤k
Remark: The rational behind this choice comes from gambling. If we bet on
a particular outcome of our experiment and our gain is measured by how
often we assign the measurements to the correct class then classifier with
minimal expected risk will win on average against any other classification
rule (“Dutch books”)!
Visual Computing: Joachim M. Buhmann — Machine Learning 173/196
32. The Loss Function
Weighted mistakes are introduced when classification errors
are not equally costly; e.g. in medical diagnosis, some di-
sease classes might be harmless and others might be lethal
despite of similar symptoms.
⇒ We introduce a loss function L(y, z) which denotes the loss
for the decision z if class y is correct.
0-1 loss: all classes are treated the same!
0 if z = y (correct decision)
L0−1(y, z) = 1 if z = y and z = D (wrong decision)
d if z = D (no decision)
Visual Computing: Joachim M. Buhmann — Machine Learning 174/196
33. • weighted classification costs L(y, z) ∈ R+ are frequently
used, e.g. in medicine;
classification costs can also be asymmetric, that means
L(y, z) = L(z, y) ((z, y) ∼ (pancreas cancer, gastritis).
Conditional Risk function of the classifier is the expected
loss of class y
R(ˆ, y) = Ex [L(y, c(x))|Y = y]
c ˆ
= L(y, z)P{ˆ(x) = z|Y = y}
c
z≤k
+L(y, D)P{ˆ(x) = D|Y = y}
c
= P{ˆ(x) = y ∧ c(x) = D|Y = y} + d · P{ˆ(x) = D|Y = y}
c ˆ c
pmc(y) probability of misclassification pd(y) probability of doubt
Visual Computing: Joachim M. Buhmann — Machine Learning 175/196
34. Total risk of the classifier: (πy := P(Y = y))
R(ˆ) =
c πz pmc(z) + d πz pd(z) = EC R(ˆ, C)
c
z≤k z≤k
Asymptotic average loss
1
lim ˆ ˆ c
L(cj , c(xj )) = lim R(ˆ) = R(ˆ),
c
n→∞ n n→∞
j≤n
where {(xj , cj )|1 ≤ j ≤ n} is a random sample set of size n.
This formula can be interpreted as the expected loss with empirical distribution as probability model.
Visual Computing: Joachim M. Buhmann — Machine Learning 176/196
35. Posterior class probability
Posterior: Let
πy py (x)
p(y|x) ≡ P{Y = y|X = x} =
z πz pz (x)
be the posterior of the class y given X = x.
(The ‘Partition of One” πy py (x)/ z πz pz (x) results from the normalizati-
on z p(z|x) = 1. )
Likelihood: The class conditional density py (x) is the probabi-
lity of observing data X = x given class Y = y.
Prior: πy is the probability of class Y = y.
Visual Computing: Joachim M. Buhmann — Machine Learning 177/196
36. Bayes Optimal Classifier
Theorem 1 The classification rule which minimizes the total
risk for 0 − 1 loss is
y if p(y|x) = maxz≤k p(z|x) > 1 − d,
c(x) =
D if p(y|x) ≤ 1 − d ∀y.
Generalization to arbitrary loss functions
y if z L(z, y)p(z|x) = minρ≤k z L(z, ρ)p(z|x) ≤ d,
c(x) =
D else .
Bayes classifier: Select the class with highest πy py (x) value if
it exceeds the costs for not making a decision, i.e., πy py (x) >
(1 − d)p(x).
Visual Computing: Joachim M. Buhmann — Machine Learning 178/196
37. Proof: Calculate the total expected loss R(ˆ)
c
R(ˆ) = EX EY L0−1(Y, c(x))|X = x
c ˆ
= EY L0−1(Y, c(x))|X = x p(x)dx with p(x) =
ˆ πz pz (x)
X z≤k
Minimize the conditional expectation value since it depends only on c.
ˆ
c(x) = argminc∈{1,...,k,D}E L0−1(Y, c)|X = x
ˆ ˜ ˜
= argminc∈{1,...,k,D}
˜ L0−1(z, c)p(z|x)
˜
z≤k
argminc∈{1,...,k} (1 − p(˜|x)) if d > minc(1 − p(c|x))
˜ c
=
D else
argmaxc∈{1,...,k}p(˜|x) if 1 − d < maxc p(c|x)
˜ c
=
D else
Visual Computing: Joachim M. Buhmann — Machine Learning 179/196
38. Outliers
• Modeling by an outlier class πO with pO (x)
• “Novelty Detection”: Classify a measurement as an outlier
if
πO pO (x) ≥ max (1 − d)p(x), max πz pz (x)
z
• The outlier concept causes conceptual problems and it does not fit to the
statistical decision theory since outliers indicate an erroneous or incom-
plete specification of the statistical model!
• The outlier class is often modeled by a uniform distribution.
Attention: Normalization of uniform distribution does not exist in many
feature spaces!
=⇒ Limit the support of the measurement space or put a (Gaussian)
measure on it!
Visual Computing: Joachim M. Buhmann — Machine Learning 180/196
39. Class Conditional Densities and Posteriors for 2
Classes
Class-conditional probability den- Posterior probabilities for priors
sity function P(y1) = 2 , P(y2) = 1 .
3 3
p(x|ωi) P(ωi|x)
0.4 1
ω2
ω1
0.8
0.3 ω1
0.6
0.2
0.4
ω2
0.1
0.2
x x
9 10 11 12 13 14 15 9 10 11 12 13 14 15
GURE 2.1. Hypothetical class-conditional probability density FIGURE 2.2. Posterior probabilities for the particular priors P (ω1 ) = 2/3 and
functions show the
obability density of measuring a particular feature value x given1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus
= the pattern is in
tegory ωi . If x represents the lightness of a fish, the two curves might describea the
case, given that pattern is measured to have feature value x = 14, the probabi
fference in lightness of populations of two types of fish. Density functions areω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x , the posterio
in category normal-
ed, and thus the area under each curve is 1.0. From: Richard O.to 1.0. From: E. Hart, O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifi
Duda, Peter Richard
nd David G. Stork, Pattern Classification. Copyright c 2001 byCopyright c & Sons, John Wiley & Sons, Inc.
John Wiley 2001 by
c. Visual Computing: Joachim M. Buhmann — Machine Learning 181/196
40. Likelihood Ratio for 2 Class Example
p(x|ω1)
p(x|ω2)
θb
θa
x
R2 R1 R2 R1
RE 2.3. The likelihood ratio p(x |ω1 )/p(x |ω2 ) for the distributions show
1. IfVisual Computing: Joachim M.zero-one or Learning
we employ a Buhmann — Machine classification loss, our decision boundari
182/196
41. Discriminant Functions gl
action
(e.g., classification)
costs
discriminant g1(x) g2(x) ... gc(x)
functions
input x1 x2 x3 ... xd
FIGURE 2.5. The functional structure of a general statistical pattern classifier which
includes d inputs and c discriminant functions g (x). A subsequent step determines
• Discriminant function: gz (x) = P{Y = y|Xi = x}categorizes the input pattern
which of the discriminant values is the maximum, and
accordingly. The arrows show the direction of the flow of information, though frequently
• Class decision: gy (x) > gzthe direction of flow is self-evident. From: Richard O.
the arrows are omitted when (x) ∀z = y ⇒ class y.
Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by
John Wiley & Sons, Inc.
• Different discriminant functions can yield the same decision:
gy (x) = log P{x|y} + log πy ; minimize implementation problems!
˜
Visual Computing: Joachim M. Buhmann — Machine Learning 183/196
42. Example for Discriminant Functions
0.3
p(x|ω1)P(ω1) p(x|ω2)P(ω2)
0.2
0.1
0
R1 R2
R2
decision 5
boundary
5
0
0
GURE 2.6. In this two-dimensional two-category classifier, the probability densitie
e Gaussian, the Joachim M. Buhmann — Machineconsists of two hyperbolas, and thus the decisio
Visual Computing: decision boundary Learning 184/196
43. Adaptation of Discriminant Functions gl
teacher
action signal
(e.g., classification)
-
MAX
discriminant g (x ) g2(x ) . . . gc(x )
1
functions
input x1 x2 x3 ... xd
The red connections (weights) are adapted in such a way that the teacher
signal is imitated by the discriminant function.
Visual Computing: Joachim M. Buhmann — Machine Learning 185/196
44. Example Discriminant Functions: Normal
Distributions
The Likelihood of class y is Gaussian distributed.
1 1
py (x) = exp − (x − µy )T Σ−1(x − µy )
y
(2π)d|Σy | 2
Special case: Σy = σ 2I
gy (x) = log py (x) + log πy
1
= − 2 x − µy 2 + log πy + const.
2σ
Visual Computing: Joachim M. Buhmann — Machine Learning 186/196
45. ⇒ Decision surface between class z and y:
1 1
− 2 x − µz 2 + log πz = − 2 x − µy 2 + log πy
2σ 2σ
− x 2 + 2x · µz − µz 2 + 2σ 2 log πz = − x 2 + 2x · µy − µy 2 + 2σ 2 log πy
2 2 2 πz
⇒ 2x · (µz − µy ) − µz + µy + 2σ log =0
πy
Linear decision rule: wT (x − x0) = 0
1 σ 2(µz − µy ) πz
with w = µz − µy x0 = (µz + µy ) − 2
log
2 µz − µy πy
Visual Computing: Joachim M. Buhmann — Machine Learning 187/196
46. Decision Surface for Gaussians in 1,2,3
Dimensions
4
2 2
0
-2 ω2 1
0.15
ω1
0
0.1 P(ω2)=.5
p(x|ωi)
ω1 ω2 2
0.4 ω2
0.05
1
0.3 0 ω1
0 R2
0.2
-1
0.1 P(ω2)=.5
P(ω1)=.5 R2 -2 P(ω1)=.5 R1
x R1 -2
-2 0 2 4
-2 -1
0 0
R1 R2 2 1
P(ω1)=.5 P(ω2)=.5 4 2
FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identity
matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of
d − 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensional
examples, we indicate p(x|ωi ) and the boundaries for the case P (ω1 ) = P (ω2 ). In the three-dimensional case,
the grid plane separates R1 from R2 . From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern
Classification. Copyright c 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 188/196
47. p(x|ωi) p(x|ωi)
ω1 ω2 ω1 ω2
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
x x
-2 0 2 4 -2 0 2 4
R1 R2 R1 R2
P(ω1)=.7 P(ω2)=.3 P(ω1)=.9 P(ω2)=.1
4 4
2 2
0 0
-2 -2
ω2
ω1 ω2 ω1
0.15 0.15
0.1 0.1
0.05 0.05
0 0
P(ω2)=.01
P(ω2)=.2
R2
P(ω1)=.8 R2
P(ω1)=.99
R1
-2 -2 R1
0 0
2 2
4 4
3 4
2
1 2
0 P(ω2)=.2 0
2 2 R2
R2
R1
R1
1 ω2 1
ω1
ω1 P(ω2)=.01
0 0
ω2
-1 -1
P(ω1)=.8 P(ω1)=.99
-2 -2
-2 -2
-1 -1
0 0
1 1
2 2
FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficiently
disparate priors the boundary will not lie between the means of these one-, two- and
three-dimensional spherical Gaussian distributions. From: Richard O. Duda, Peter E.
Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley &
Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 189/196
48. Multi Class Case
R4 R3
R2 R4
R1
E 2.16. The decision regions for four normal distributions. Even with such a
r of Decision regions for four Gaussianboundary regionsfor such arathernum-
categories, the shapes of the distributions. Even can be small complex. F
d O. Duda, Peter E.discriminant functionsG. Stork, Pattern Classification. Copy
ber of classes the Hart, and David show a complex form.
1 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 190/196
49. Example: Gene Expression Data
The expression of genes is measured for various patients. The
expression profiles provide information of the metabolic state of
the cells, meaning that they could be used as indicators for di-
sease classes. Each patient is represented as a vector in a high
dimensional (≈ 10000) space with Gaussian class distribution.
Genes
ALL B−Cell
Samples
AML ALL T−Cell
Pred
True
Visual Computing: Joachim M. Buhmann — Machine Learning 191/196
50. Parametric Models for Class Densities
If we would know the prior probabilities and the class conditio-
nal probabilities then we could calculate the optimal classifier.
But we don’t!
Task: Estimate p(y|x; θ) from samples Z = {(x1, y1), . . . , (xn, yn)}
for classification.
Data are sorted according to their classes:
Xy = {X1y , . . . , Xny ,y } where Xiy ∼ P{X|Y = y; θy }
Question: How can we use the information in samples to esti-
mate θy ?
Assumption: classes can be separated and treated indepen-
dently! Xy is not informative w.r.t. θz , z = y
Visual Computing: Joachim M. Buhmann — Machine Learning 192/196
51. Maximum Likelihood Estimation Theory
Likelihood of the data set: P{Xy |θy } = i≤ny p(xiy |θy )
ˆ
Estimation principle: Select the parameters θy which maximi-
ze the likelihood, that means
ˆ
θy = arg max P{Xy |θy }
θy
Procedure: Find the extreme value of the log-likelihood functi-
on
θy log P{X |θy } = 0
∂
log p(xi|θy ) = 0
∂θy
i≤n
Visual Computing: Joachim M. Buhmann — Machine Learning 193/196
52. Remark
Bias of an estimator: ˆ ˆ
bias(θn) = E{θn} − θ.
ˆ
Consistent estimator: A point estimator θn of a parameter θ
P
ˆn → θ.
is consistent if θ
Asymptotic Normality of Maximum Likelihood estimates:
ˆ ˆ
(θn − θ)/ V{θn} N (0, 1).
Alternative to ML class density estimation: discriminative
learning by maximizing the a posteriori distribution P{θy |Xy }
(details of the density do not have to be modelled since they might not influence the po-
sterior)
Visual Computing: Joachim M. Buhmann — Machine Learning 194/196
53. Example: Multivariate Normal Distribution
Expectation values of a normal distribution and its estimation:
Class index has been omitted for legibility reasons (θy → θ).
1 T −1 d 1
log p(xi|θ) = − (xi − µ) Σ (xi − µ) − log 2π − log |Σ|
2 2 2
∂ 1 1 T
log p(xi|θ) = Σ−1(xi − µ) + (xi − µ)Σ−1 = 0
∂µ 2 2
i≤n i≤n i≤n
1
Σ−1 (xi − µ) = 0 ⇒ µn =
ˆ xi estimator for µ
n i
i≤n
Average value formula results from the quadratic form.
1
Unbiasedness: E[ˆn] =
µ Exi = E[x] = µ
n
i≤n
Visual Computing: Joachim M. Buhmann — Machine Learning 195/196
54. ML estimation of the variance (1d case)
∂ ∂ 1 2 n
log p(xi|θ) = − 2 xi − µ − log(2πσ 2)
∂σ 2 ∂σ σ 2 2
i≤n i≤n
1 −4 2 n −2
= σ xi − µ − σ = 0
2 2
i≤n
1
⇒ ˆ2
σn = xi − µ 2
n
i≤n
ˆn = 1
Multivariate case Σ (xi − µ)(xi − µ)T
n
i≤n
ˆ ˆ
Σn is biased, e.g., EΣn = Σ, if µ is unknown.
Visual Computing: Joachim M. Buhmann — Machine Learning 196/196