• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Machine Learning
 

Machine Learning

on

  • 608 views

 

Statistics

Views

Total Views
608
Views on SlideShare
608
Embed Views
0

Actions

Likes
0
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Machine Learning Machine Learning Presentation Transcript

    • Machine Learning Central Problem of Pattern Recognition: Supervised and Unsupervised Learning Classification Bayesian Decision Theory Perceptrons and SVMs Clustering Visual Computing: Joachim M. Buhmann — Machine Learning 143/196
    • Machine Learning – What is the Challenge? Find optimal structure in data and validate it! Concept for Robust Data Analysis Structure Structure Data Structure optimization Validation vectors, relations, definition multiscale analysis, statistical images,... (costs, risk, ...) stochastic approximation learning theory Quantization of x solution space Regularization Information/Rate Distortion Theory Feedback of statistical & computational complexity 8 March 2006 Joachim M. Buhmann / Institute for Computational Science 3 Visual Computing: Joachim M. Buhmann — Machine Learning 144/196
    • The Problem of Pattern Recognition Machine Learning (as statistics) addresses a number of chal- lenging inference problems in pattern recognition which span the range from statistical modeling to efficient algorithmics. Approximative method which yield good performance on ave- rage are particularly important. • Representation of objects. ⇒ Data representation • What is a pattern? Definition/modeling of structure. • Optimization: Search for prefered structures • Validation: are the structures indeed in the data or are they explained by fluctuations? Visual Computing: Joachim M. Buhmann — Machine Learning 145/196
    • Literatur • Richard O. Duda, Peter E. Hart & David G. Stork, Pattern Classification. Wiley & Sons (2001) • Trevor Hastie, Robert Tibshirani & Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Ver- lag (2001) • Luc Devroye, Laslo Gyorfi & Gabor Lugosi, A Probabilistic Theory of Pat- ¨ tern Recognition. Springer Verlag (1996) • Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Da- ta. Springer Verlag (1983); The Nature of Statistical Learning Theory. Springer Verlag (1995) • Larry Wasserman, All of Statistics. (1st ed. 2004. Corr. 2nd printing, ISBN: 0-387-40272-1) Springer Verlag (2004) Visual Computing: Joachim M. Buhmann — Machine Learning 146/196
    • The Classification Problem Visual Computing: Joachim M. Buhmann — Machine Learning 147/196
    • Visual Computing: Joachim M. Buhmann — Machine Learning 148/196
    • Classification as a Pattern Recognition Problem Problem: We look for a partition of the object space O (fish in the previous example) which corresponds to classification examples. Distinguish conceptually between “objects” o ∈ O and “data” x ∈ X ! Data: pairs of feature vectors and class labels Z = {(xi, yi) : 1 ≤ i ≤ n, xi ∈ Rd, yi ∈ {1, . . . , k}} Definitions: feature space X with xi ∈ X ⊂ Rd class labels yi ∈ {1, . . . , k} Classifier: mapping c : X → {1, . . . , k} k class problem: What is yn+1 ∈ {1, . . . , k} for xn+1 ∈ Rd? Visual Computing: Joachim M. Buhmann — Machine Learning 149/196
    • Example of Classification Visual Computing: Joachim M. Buhmann — Machine Learning 150/196
    • Histograms of Length Values salmon sea bass count 22 20 18 16 12 10 8 6 4 2 0 length 5 10 15 20 25 l* GURE 1.2. Histograms for the length feature for the two categories. No single thresh- d value of the length will serve to unambiguously discriminate between the two cat- Visual Computing: Joachim M. Buhmann — Machine Learning 151/196 ories; using length alone, we will have some errors. The value marked l ∗ will lead to
    • Histograms of Skin Brightness Values count 14 salmon sea bass 12 10 8 6 4 2 0 lightness 2 4 x* 6 8 10 URE 1.3. Histograms for the lightness feature for the two categories. No single hold value x ∗ (decision boundary) will serve to unambiguously discriminate be- Visual Computing: Joachim M. Buhmann — Machine Learning 152/196 n the two categories; using lightness alone, we will have some errors. The value x ∗
    • Linear Classification width 22 salmon sea bass 21 20 19 18 17 16 15 14 lightness 2 4 6 8 10 URE 1.4. The two features of lightness and width for sea bass and salmon. The d could serve as Joachim M. Buhmann — Machine Learning our classifier. Overall classification error Visual Computing: a decision boundary of 153/196
    • Overfitting width 22 salmon sea bass 21 20 19 18 ? 17 16 15 14 lightness 2 4 6 8 10 1.5. Overly complex models for the fish will lead to decision bounda plicated. While such a decision may lead to perfect classification of our Visual Computing: Joachim M. Buhmann — Machine Learning 154/196 it would lead to poor performance on future patterns. The novel te
    • Optimized Non-Linear Classification width 22 salmon sea bass 21 20 19 18 17 16 15 14 lightness 2 4 6 8 10 1.6. The razor argument: Entia non sunt multiplicanda praeter necessitatem! optimal trad Occam’s decision boundary shown might represent the erformance on the training set and simplicity of classifier, thereby gi Visual Computing: Joachim M. Buhmann — Machine Learning 155/196 ccuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and D
    • Regression (see Introduction to Machine Learning) Question: Given a feature (vector) xi and a corre- sponding noisy measure- ment of a function value yi = f (xi) + noise, what is the unknown function f (.) in a hypothesis class H? Data: Z = {(xi, yi) ∈ Rd × R : 1 ≤ i ≤ n} Modeling choice: What is an adequate hypothesis class and a good noise model? Fitting with linear/nonlinear functions? Visual Computing: Joachim M. Buhmann — Machine Learning 156/196
    • The Regression Function Questions: (i) What is the statistically optimal estimate of a function f : Rd → R and (ii) which algorithm achieves this goal most efficiently? Solution to (i): the regression function y(x) = E {y|X = x} = y p(y|X = x)dy Ω Nonlinear regression of a sinc function sinc(x) := sin(x)/x (gray) with a regression fit (black) based on 50 noisy data. Visual Computing: Joachim M. Buhmann — Machine Learning 157/196
    • Examples of linear and nonlinear regression linear regression nonlinear regression How should we measure the deviations? vertical offsets perpendicular offsets Visual Computing: Joachim M. Buhmann — Machine Learning 158/196
    • Core Questions of Pattern Recognition: Unsupervised Learning No teacher signal is available for the learning algorithm; lear- ning is guided by a general cost/risk function. Examples for unsupervised learning 1. data clustering, vector quantization: as in classification we search for a partitioning of objects in groups; but explicit labelings are not available. 2. hierarchical data analysis; search for tree structures in data 3. visualisation, dimension reduction Semisupervised learning: some of the data are labeled, most of them are unlabeled. Visual Computing: Joachim M. Buhmann — Machine Learning 159/196
    • Modes of Learning Reinforcement Learning: weakly supervised learning Action chains are evaluated at the end. Backgammon; the neural network TD-Gammon gained the world championship! Quite popular in Robotics Active Learning: Data are selected according to their expec- ted information gain. Information Filtering Inductive Learning: the learning algorithm extracts logical ru- les from the data. Inductive Logic Programming is a popular sub area of Artifi- cial Intelligence Visual Computing: Joachim M. Buhmann — Machine Learning 160/196
    • Vectorial Data Data of 20 Gaussian 1 G sources in R20, pro- M L L L E M jected onto two di- LL ME M J G 0.5 L LE L LL M EEE M MM J J mensions with Princi- E J G NN MJ G GGGJ J J E E N KJ G G pal Component Ana- H G KK K C N E N H N T Q H K Q K K K K lysis. B N T H K D C CC NT HT H R 0 D N NT H Q P D CD T P D C H RRQ IR IQ I I D C D P P T T H R I I P PD T I IR PDD P BB H I R C Q Q R I C B BBB S SS S R RQ Q P C P B BBS O F FQ SSO S F F FF F O S A -0.5 O S A FF O OO O OO A AA A A A A A -1 -1 -0.5 0 0.5 1 Visual Computing: Joachim M. Buhmann — Machine Learning 161/196
    • Relational Data Pairwise dissimilarity of 145 globins which have been selected from 4 classes of α-globine, β-globine, myoglobins and glo- bins of insects and plants. Visual Computing: Joachim M. Buhmann — Machine Learning 162/196
    • Scales for Data Nominal or categorial scale: qualitative, but without quantita- tive measurements, e.g. binary scale F = {0, 1} (presence or absence of proper- ties like “kosher”) or taste categories “sweet, sour, salty and bitter. Ordinal scale : measurement values are meaningful only with respect to other measurements, i.e., the rank order of mea- surements carries the information, not the numerical diffe- rences (e.g. information on the ranking of different marathon races!?) Visual Computing: Joachim M. Buhmann — Machine Learning 163/196
    • Quantitative scale: • interval scale: the relation of numerical differences car- ries the information. Invariance w.r.t. translation and sca- ling (Fahrenheit scale of temperature). • ratio scale: zero value of the scale carries information but not the measurement unit. (Kelvin scale). • Absolute scale: Absolute values are meaningful. (grades of final exams) Visual Computing: Joachim M. Buhmann — Machine Learning 164/196
    • Machine Learning: Topic Chart • Core problems of pattern recognition • Bayesian decision theory • Perceptrons and Support vector machines • Data clustering Visual Computing: Joachim M. Buhmann — Machine Learning 165/196
    • Bayesian Decision Theory The Problem of Statistical Decisions Task: textbf n objects have to be partitioned in 1, . . . , k classes, the doubt class D and the outlier class O. D : doubt class (→ new measurements required) O : outlier class, definitively none of the classes 1, 2, . . . , k Objects are characterized by feature vectors X ∈ X , X ∼ P(X) with the probability P(X = x) of feature values x. Statistical modeling: Objects represented by data X and classes Y are considered to be random variables, i.e., (X, Y ) ∼ P(X, Y ). Conceptually, it is not mandatory to consider class labels as random since they might be induced by legal considerations or conventions. Visual Computing: Joachim M. Buhmann — Machine Learning 166/196
    • Structure of the feature space X • X ⊂ Rd • X = X1 × X2 × · · · × Xd with Xi ⊆ R or Xi finite. Remark: in most situations we can define the feature space as subsets of Rd or as tuples of real, categorial (B = {0, 1}) or ordinal (K ⊂ K}) numbers. Sometimes we have more complicated data spaces composed of lists, trees or graphs. Class density / likelihood: py (x) := P(X = x|Y = y) is equal to the probability of a feature value x given a class y. Parametric Statistics: estimate the parameters of the class densities py (x) Non-Parametric Statistics: minimize the empirical risk Visual Computing: Joachim M. Buhmann — Machine Learning 167/196
    • Motivation of Classification Given are labeled data Z = {(xi, yi) : i ≤ n} Questions: 1. What are the class boundaries? 2. What are the class specific densities py (x)? 3. How many modes or parameters do we need to model Figure: quadratic SVM classifier for five classes. py (x)? White areas are ambiguous regions. 4. ... Visual Computing: Joachim M. Buhmann — Machine Learning 168/196
    • Thomas Bayes and his Terminology The State of Nature is modelled as a random variable! prior: P{model} likelihood: P{data|model} posterior: P{model|data} evidence: P{data} P{data|model}P{model} Bayes Rule: P{model|data} = P{data} Visual Computing: Joachim M. Buhmann — Machine Learning 169/196
    • Ronald A. Fisher and Frequentism Fisher, Ronald Aylmer (1890-1962): founder of frequentist statistics together with Jerzey Neyman & Karl Pearson. British mathematician and biologist who in- vented revolutionary techniques for apply- ing statistics to natural sciences. Maximum likelihood method Fisher information: a measure for the infor- mation content of densities. Sampling theory Hypothesis testing Visual Computing: Joachim M. Buhmann — Machine Learning 170/196
    • Bayesianism vs. Frequentist Inference1 Bayesianism is the philosophical tenet that the mathematical theory of pro- bability applies to the degree of plausibility of statements, or to the degree of belief of rational agents in the truth of statements; together with Bayes theorem, it becomes Bayesian inference. The Bayesian interpretation of probability allows probabilities assigned to random events, but also al- lows the assignment of probabilities to any other kind of statement. Bayesians assign probabilities to any statement, even when no random process is involved, as a way to represent its plausibility. As such, the scope of Bayesian inquiries include the scope of frequentist inquiries. The limiting relative frequency of an event over a long series of trials is the conceptual foundation of the frequency interpretation of probability. Frequentism rejects degree-of-belief interpretations of mathematical pro- bability as in Bayesianism, and assigns probabilities only to random events according to their relative frequencies of occurrence. 1 see http://encyclopedia.thefreedictionary.com/ Visual Computing: Joachim M. Buhmann — Machine Learning 171/196
    • Bayes Rule for Known Densities and Parameters Assume that we know how the features are distributed for the different classes, i.e., the class conditional densities and their parameters are known.What is the best classification strat- egy in this situation? Classifier: c : X → {1, . . . , k, D} ˆ The assignment function c maps the feature space X to the ˆ set of classes {1, . . . , k, D}. (Outliers are neglected) Quality of a classifier: Whenever a classifier returns a label which differs from the correct class Y = y then it has made a mistake. Visual Computing: Joachim M. Buhmann — Machine Learning 172/196
    • Error count: The indicator function I{ˆ(x)=y} c x∈X counts the classifier mistakes. Note that this error count is a random variable! Expected errors also called expected risk define the quality of a classifier R(ˆ) = c P(y)EP(x) I{ˆ(x)=y}|Y = y + terms from D c y≤k Remark: The rational behind this choice comes from gambling. If we bet on a particular outcome of our experiment and our gain is measured by how often we assign the measurements to the correct class then classifier with minimal expected risk will win on average against any other classification rule (“Dutch books”)! Visual Computing: Joachim M. Buhmann — Machine Learning 173/196
    • The Loss Function Weighted mistakes are introduced when classification errors are not equally costly; e.g. in medical diagnosis, some di- sease classes might be harmless and others might be lethal despite of similar symptoms. ⇒ We introduce a loss function L(y, z) which denotes the loss for the decision z if class y is correct. 0-1 loss: all classes are treated the same!  0 if z = y (correct decision)   L0−1(y, z) = 1 if z = y and z = D (wrong decision)  d if z = D (no decision)  Visual Computing: Joachim M. Buhmann — Machine Learning 174/196
    • • weighted classification costs L(y, z) ∈ R+ are frequently used, e.g. in medicine; classification costs can also be asymmetric, that means L(y, z) = L(z, y) ((z, y) ∼ (pancreas cancer, gastritis). Conditional Risk function of the classifier is the expected loss of class y R(ˆ, y) = Ex [L(y, c(x))|Y = y] c ˆ = L(y, z)P{ˆ(x) = z|Y = y} c z≤k +L(y, D)P{ˆ(x) = D|Y = y} c = P{ˆ(x) = y ∧ c(x) = D|Y = y} + d · P{ˆ(x) = D|Y = y} c ˆ c pmc(y) probability of misclassification pd(y) probability of doubt Visual Computing: Joachim M. Buhmann — Machine Learning 175/196
    • Total risk of the classifier: (πy := P(Y = y)) R(ˆ) = c πz pmc(z) + d πz pd(z) = EC R(ˆ, C) c z≤k z≤k Asymptotic average loss 1 lim ˆ ˆ c L(cj , c(xj )) = lim R(ˆ) = R(ˆ), c n→∞ n n→∞ j≤n where {(xj , cj )|1 ≤ j ≤ n} is a random sample set of size n. This formula can be interpreted as the expected loss with empirical distribution as probability model. Visual Computing: Joachim M. Buhmann — Machine Learning 176/196
    • Posterior class probability Posterior: Let πy py (x) p(y|x) ≡ P{Y = y|X = x} = z πz pz (x) be the posterior of the class y given X = x. (The ‘Partition of One” πy py (x)/ z πz pz (x) results from the normalizati- on z p(z|x) = 1. ) Likelihood: The class conditional density py (x) is the probabi- lity of observing data X = x given class Y = y. Prior: πy is the probability of class Y = y. Visual Computing: Joachim M. Buhmann — Machine Learning 177/196
    • Bayes Optimal Classifier Theorem 1 The classification rule which minimizes the total risk for 0 − 1 loss is y if p(y|x) = maxz≤k p(z|x) > 1 − d, c(x) = D if p(y|x) ≤ 1 − d ∀y. Generalization to arbitrary loss functions y if z L(z, y)p(z|x) = minρ≤k z L(z, ρ)p(z|x) ≤ d, c(x) = D else . Bayes classifier: Select the class with highest πy py (x) value if it exceeds the costs for not making a decision, i.e., πy py (x) > (1 − d)p(x). Visual Computing: Joachim M. Buhmann — Machine Learning 178/196
    • Proof: Calculate the total expected loss R(ˆ) c R(ˆ) = EX EY L0−1(Y, c(x))|X = x c ˆ = EY L0−1(Y, c(x))|X = x p(x)dx with p(x) = ˆ πz pz (x) X z≤k Minimize the conditional expectation value since it depends only on c. ˆ c(x) = argminc∈{1,...,k,D}E L0−1(Y, c)|X = x ˆ ˜ ˜ = argminc∈{1,...,k,D} ˜ L0−1(z, c)p(z|x) ˜ z≤k argminc∈{1,...,k} (1 − p(˜|x)) if d > minc(1 − p(c|x)) ˜ c = D else argmaxc∈{1,...,k}p(˜|x) if 1 − d < maxc p(c|x) ˜ c = D else Visual Computing: Joachim M. Buhmann — Machine Learning 179/196
    • Outliers • Modeling by an outlier class πO with pO (x) • “Novelty Detection”: Classify a measurement as an outlier if πO pO (x) ≥ max (1 − d)p(x), max πz pz (x) z • The outlier concept causes conceptual problems and it does not fit to the statistical decision theory since outliers indicate an erroneous or incom- plete specification of the statistical model! • The outlier class is often modeled by a uniform distribution. Attention: Normalization of uniform distribution does not exist in many feature spaces! =⇒ Limit the support of the measurement space or put a (Gaussian) measure on it! Visual Computing: Joachim M. Buhmann — Machine Learning 180/196
    • Class Conditional Densities and Posteriors for 2 Classes Class-conditional probability den- Posterior probabilities for priors sity function P(y1) = 2 , P(y2) = 1 . 3 3 p(x|ωi) P(ωi|x) 0.4 1 ω2 ω1 0.8 0.3 ω1 0.6 0.2 0.4 ω2 0.1 0.2 x x 9 10 11 12 13 14 15 9 10 11 12 13 14 15 GURE 2.1. Hypothetical class-conditional probability density FIGURE 2.2. Posterior probabilities for the particular priors P (ω1 ) = 2/3 and functions show the obability density of measuring a particular feature value x given1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus = the pattern is in tegory ωi . If x represents the lightness of a fish, the two curves might describea the case, given that pattern is measured to have feature value x = 14, the probabi fference in lightness of populations of two types of fish. Density functions areω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x , the posterio in category normal- ed, and thus the area under each curve is 1.0. From: Richard O.to 1.0. From: E. Hart, O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifi Duda, Peter Richard nd David G. Stork, Pattern Classification. Copyright c 2001 byCopyright c & Sons, John Wiley & Sons, Inc. John Wiley 2001 by c. Visual Computing: Joachim M. Buhmann — Machine Learning 181/196
    • Likelihood Ratio for 2 Class Example p(x|ω1) p(x|ω2) θb θa x R2 R1 R2 R1 RE 2.3. The likelihood ratio p(x |ω1 )/p(x |ω2 ) for the distributions show 1. IfVisual Computing: Joachim M.zero-one or Learning we employ a Buhmann — Machine classification loss, our decision boundari 182/196
    • Discriminant Functions gl action (e.g., classification) costs discriminant g1(x) g2(x) ... gc(x) functions input x1 x2 x3 ... xd FIGURE 2.5. The functional structure of a general statistical pattern classifier which includes d inputs and c discriminant functions g (x). A subsequent step determines • Discriminant function: gz (x) = P{Y = y|Xi = x}categorizes the input pattern which of the discriminant values is the maximum, and accordingly. The arrows show the direction of the flow of information, though frequently • Class decision: gy (x) > gzthe direction of flow is self-evident. From: Richard O. the arrows are omitted when (x) ∀z = y ⇒ class y. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. • Different discriminant functions can yield the same decision: gy (x) = log P{x|y} + log πy ; minimize implementation problems! ˜ Visual Computing: Joachim M. Buhmann — Machine Learning 183/196
    • Example for Discriminant Functions 0.3 p(x|ω1)P(ω1) p(x|ω2)P(ω2) 0.2 0.1 0 R1 R2 R2 decision 5 boundary 5 0 0 GURE 2.6. In this two-dimensional two-category classifier, the probability densitie e Gaussian, the Joachim M. Buhmann — Machineconsists of two hyperbolas, and thus the decisio Visual Computing: decision boundary Learning 184/196
    • Adaptation of Discriminant Functions gl teacher action signal (e.g., classification) - MAX discriminant g (x ) g2(x ) . . . gc(x ) 1 functions input x1 x2 x3 ... xd The red connections (weights) are adapted in such a way that the teacher signal is imitated by the discriminant function. Visual Computing: Joachim M. Buhmann — Machine Learning 185/196
    • Example Discriminant Functions: Normal Distributions The Likelihood of class y is Gaussian distributed. 1 1 py (x) = exp − (x − µy )T Σ−1(x − µy ) y (2π)d|Σy | 2 Special case: Σy = σ 2I gy (x) = log py (x) + log πy 1 = − 2 x − µy 2 + log πy + const. 2σ Visual Computing: Joachim M. Buhmann — Machine Learning 186/196
    • ⇒ Decision surface between class z and y: 1 1 − 2 x − µz 2 + log πz = − 2 x − µy 2 + log πy 2σ 2σ − x 2 + 2x · µz − µz 2 + 2σ 2 log πz = − x 2 + 2x · µy − µy 2 + 2σ 2 log πy 2 2 2 πz ⇒ 2x · (µz − µy ) − µz + µy + 2σ log =0 πy Linear decision rule: wT (x − x0) = 0 1 σ 2(µz − µy ) πz with w = µz − µy x0 = (µz + µy ) − 2 log 2 µz − µy πy Visual Computing: Joachim M. Buhmann — Machine Learning 187/196
    • Decision Surface for Gaussians in 1,2,3 Dimensions 4 2 2 0 -2 ω2 1 0.15 ω1 0 0.1 P(ω2)=.5 p(x|ωi) ω1 ω2 2 0.4 ω2 0.05 1 0.3 0 ω1 0 R2 0.2 -1 0.1 P(ω2)=.5 P(ω1)=.5 R2 -2 P(ω1)=.5 R1 x R1 -2 -2 0 2 4 -2 -1 0 0 R1 R2 2 1 P(ω1)=.5 P(ω2)=.5 4 2 FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identity matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of d − 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensional examples, we indicate p(x|ωi ) and the boundaries for the case P (ω1 ) = P (ω2 ). In the three-dimensional case, the grid plane separates R1 from R2 . From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Visual Computing: Joachim M. Buhmann — Machine Learning 188/196
    • p(x|ωi) p(x|ωi) ω1 ω2 ω1 ω2 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 x x -2 0 2 4 -2 0 2 4 R1 R2 R1 R2 P(ω1)=.7 P(ω2)=.3 P(ω1)=.9 P(ω2)=.1 4 4 2 2 0 0 -2 -2 ω2 ω1 ω2 ω1 0.15 0.15 0.1 0.1 0.05 0.05 0 0 P(ω2)=.01 P(ω2)=.2 R2 P(ω1)=.8 R2 P(ω1)=.99 R1 -2 -2 R1 0 0 2 2 4 4 3 4 2 1 2 0 P(ω2)=.2 0 2 2 R2 R2 R1 R1 1 ω2 1 ω1 ω1 P(ω2)=.01 0 0 ω2 -1 -1 P(ω1)=.8 P(ω1)=.99 -2 -2 -2 -2 -1 -1 0 0 1 1 2 2 FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficiently disparate priors the boundary will not lie between the means of these one-, two- and three-dimensional spherical Gaussian distributions. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Visual Computing: Joachim M. Buhmann — Machine Learning 189/196
    • Multi Class Case R4 R3 R2 R4 R1 E 2.16. The decision regions for four normal distributions. Even with such a r of Decision regions for four Gaussianboundary regionsfor such arathernum- categories, the shapes of the distributions. Even can be small complex. F d O. Duda, Peter E.discriminant functionsG. Stork, Pattern Classification. Copy ber of classes the Hart, and David show a complex form. 1 by John Wiley & Sons, Inc. Visual Computing: Joachim M. Buhmann — Machine Learning 190/196
    • Example: Gene Expression Data The expression of genes is measured for various patients. The expression profiles provide information of the metabolic state of the cells, meaning that they could be used as indicators for di- sease classes. Each patient is represented as a vector in a high dimensional (≈ 10000) space with Gaussian class distribution. Genes ALL B−Cell Samples AML ALL T−Cell Pred True Visual Computing: Joachim M. Buhmann — Machine Learning 191/196
    • Parametric Models for Class Densities If we would know the prior probabilities and the class conditio- nal probabilities then we could calculate the optimal classifier. But we don’t! Task: Estimate p(y|x; θ) from samples Z = {(x1, y1), . . . , (xn, yn)} for classification. Data are sorted according to their classes: Xy = {X1y , . . . , Xny ,y } where Xiy ∼ P{X|Y = y; θy } Question: How can we use the information in samples to esti- mate θy ? Assumption: classes can be separated and treated indepen- dently! Xy is not informative w.r.t. θz , z = y Visual Computing: Joachim M. Buhmann — Machine Learning 192/196
    • Maximum Likelihood Estimation Theory Likelihood of the data set: P{Xy |θy } = i≤ny p(xiy |θy ) ˆ Estimation principle: Select the parameters θy which maximi- ze the likelihood, that means ˆ θy = arg max P{Xy |θy } θy Procedure: Find the extreme value of the log-likelihood functi- on θy log P{X |θy } = 0 ∂ log p(xi|θy ) = 0 ∂θy i≤n Visual Computing: Joachim M. Buhmann — Machine Learning 193/196
    • Remark Bias of an estimator: ˆ ˆ bias(θn) = E{θn} − θ. ˆ Consistent estimator: A point estimator θn of a parameter θ P ˆn → θ. is consistent if θ Asymptotic Normality of Maximum Likelihood estimates: ˆ ˆ (θn − θ)/ V{θn} N (0, 1). Alternative to ML class density estimation: discriminative learning by maximizing the a posteriori distribution P{θy |Xy } (details of the density do not have to be modelled since they might not influence the po- sterior) Visual Computing: Joachim M. Buhmann — Machine Learning 194/196
    • Example: Multivariate Normal Distribution Expectation values of a normal distribution and its estimation: Class index has been omitted for legibility reasons (θy → θ). 1 T −1 d 1 log p(xi|θ) = − (xi − µ) Σ (xi − µ) − log 2π − log |Σ| 2 2 2 ∂ 1 1 T log p(xi|θ) = Σ−1(xi − µ) + (xi − µ)Σ−1 = 0 ∂µ 2 2 i≤n i≤n i≤n 1 Σ−1 (xi − µ) = 0 ⇒ µn = ˆ xi estimator for µ n i i≤n Average value formula results from the quadratic form. 1 Unbiasedness: E[ˆn] = µ Exi = E[x] = µ n i≤n Visual Computing: Joachim M. Buhmann — Machine Learning 195/196
    • ML estimation of the variance (1d case) ∂ ∂ 1 2 n log p(xi|θ) = − 2 xi − µ − log(2πσ 2) ∂σ 2 ∂σ σ 2 2 i≤n i≤n 1 −4 2 n −2 = σ xi − µ − σ = 0 2 2 i≤n 1 ⇒ ˆ2 σn = xi − µ 2 n i≤n ˆn = 1 Multivariate case Σ (xi − µ)(xi − µ)T n i≤n ˆ ˆ Σn is biased, e.g., EΣn = Σ, if µ is unknown. Visual Computing: Joachim M. Buhmann — Machine Learning 196/196