SlideShare a Scribd company logo
1 of 54
Download to read offline
Machine Learning

        Central Problem of Pattern Recognition:
        Supervised and Unsupervised Learning

                             Classification
                        Bayesian Decision Theory
                         Perceptrons and SVMs
                               Clustering



Visual Computing: Joachim M. Buhmann — Machine Learning   143/196
Machine Learning – What is the Challenge?
           Find optimal structure in data and validate it!
          Concept for Robust Data Analysis
                                                                               Structure                 Structure
             Data                  Structure                                   optimization              Validation
       vectors, relations,          definition                                 multiscale analysis,
                                                                                                         statistical
           images,...            (costs, risk, ...)                            stochastic
                                                                               approximation             learning theory




                                                                                                         Quantization of
                                                                        x                                solution space
                                                                                      Regularization     Information/Rate
                                                                                                         Distortion Theory
                         Feedback                                                     of statistical &
                                                                                  computational complexity
          8 March 2006                  Joachim M. Buhmann / Institute for Computational Science                     3




Visual Computing: Joachim M. Buhmann — Machine Learning                                                                      144/196
The Problem of Pattern Recognition

Machine Learning (as statistics) addresses a number of chal-
 lenging inference problems in pattern recognition which span
 the range from statistical modeling to efficient algorithmics.
 Approximative method which yield good performance on ave-
 rage are particularly important.

• Representation of objects. ⇒ Data representation

• What is a pattern? Definition/modeling of structure.

• Optimization: Search for prefered structures

• Validation: are the structures indeed in the data or are they
  explained by fluctuations?

Visual Computing: Joachim M. Buhmann — Machine Learning    145/196
Literatur

• Richard O. Duda, Peter E. Hart & David G. Stork, Pattern Classification.
  Wiley & Sons (2001)

• Trevor Hastie, Robert Tibshirani & Jerome Friedman, The Elements of
  Statistical Learning: Data Mining, Inference and Prediction. Springer Ver-
  lag (2001)

• Luc Devroye, Laslo Gyorfi & Gabor Lugosi, A Probabilistic Theory of Pat-
                         ¨
  tern Recognition. Springer Verlag (1996)

• Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Da-
  ta. Springer Verlag (1983); The Nature of Statistical Learning Theory.
  Springer Verlag (1995)

• Larry Wasserman, All of Statistics. (1st ed. 2004. Corr. 2nd printing,
  ISBN: 0-387-40272-1) Springer Verlag (2004)


Visual Computing: Joachim M. Buhmann — Machine Learning                146/196
The Classification Problem




Visual Computing: Joachim M. Buhmann — Machine Learning   147/196
Visual Computing: Joachim M. Buhmann — Machine Learning   148/196
Classification as a Pattern Recognition Problem

Problem: We look for a partition of the object space O (fish
  in the previous example) which corresponds to classification
  examples.
   Distinguish conceptually between “objects” o ∈ O and “data” x ∈ X !

Data: pairs of feature vectors and class labels
  Z = {(xi, yi) : 1 ≤ i ≤ n, xi ∈ Rd, yi ∈ {1, . . . , k}}

Definitions: feature space X with xi ∈ X ⊂ Rd

                      class labels yi ∈ {1, . . . , k}

Classifier: mapping c : X → {1, . . . , k}

k class problem: What is yn+1 ∈ {1, . . . , k} for xn+1 ∈ Rd?

Visual Computing: Joachim M. Buhmann — Machine Learning                  149/196
Example of Classification




Visual Computing: Joachim M. Buhmann — Machine Learning   150/196
Histograms of Length Values

                        salmon               sea bass
       count
       22
       20
       18
       16
       12
       10
        8
        6
        4
        2
        0                                                             length
                    5         10        15        20         25
                                   l*

GURE 1.2. Histograms for the length feature for the two categories. No single thresh-
d value of the length will serve to unambiguously discriminate between the two cat-
     Visual Computing: Joachim M. Buhmann — Machine Learning               151/196
ories; using length alone, we will have some errors. The value marked l ∗ will lead to
Histograms of Skin Brightness Values

         count
        14              salmon                   sea bass
        12

        10

         8

         6

         4

         2

         0                                                        lightness
                    2            4    x* 6         8         10

URE 1.3. Histograms for the lightness feature for the two categories. No single
 hold value x ∗ (decision boundary) will serve to unambiguously discriminate be-
     Visual Computing: Joachim M. Buhmann — Machine Learning             152/196
 n the two categories; using lightness alone, we will have some errors. The value x ∗
Linear Classification

         width
         22           salmon                         sea bass
         21
         20
         19
         18
         17
         16
         15
         14                                                                 lightness
                       2            4           6           8          10

URE 1.4. The two features of lightness and width for sea bass and salmon. The d
could serve as Joachim M. Buhmann — Machine Learning our classifier. Overall classification error
    Visual Computing: a decision boundary of                                       153/196
Overfitting
       width
       22        salmon                 sea bass
       21
       20
       19
       18
                               ?

       17
       16
       15
       14                                                 lightness
                  2        4        6        8       10

 1.5. Overly complex models for the fish will lead to decision bounda
plicated. While such a decision may lead to perfect classification of our
     Visual Computing: Joachim M. Buhmann — Machine Learning 154/196
  it would lead to poor performance on future patterns. The novel te
Optimized Non-Linear Classification
        width
        22           salmon                      sea bass
         21
         20
         19
         18
         17
         16
         15
         14                                                            lightness
                      2          4          6           8         10

 1.6. The razor argument: Entia non sunt multiplicanda praeter necessitatem! optimal trad
    Occam’s decision boundary shown might represent the

erformance on the training set and simplicity of classifier, thereby gi
     Visual Computing: Joachim M. Buhmann — Machine Learning                  155/196
 ccuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and D
Regression
                       (see Introduction to Machine Learning)


Question: Given a feature
 (vector) xi and a corre-
 sponding noisy measure-
 ment of a function value
 yi = f (xi) + noise, what is
 the unknown function f (.)
 in a hypothesis class H?


Data: Z = {(xi, yi) ∈ Rd × R : 1 ≤ i ≤ n}

Modeling choice: What is an adequate hypothesis class and
 a good noise model? Fitting with linear/nonlinear functions?
Visual Computing: Joachim M. Buhmann — Machine Learning         156/196
The Regression Function

Questions: (i) What is the statistically optimal estimate of a
 function f : Rd → R and (ii) which algorithm achieves this
 goal most efficiently?

Solution to (i): the regression function

                    y(x) = E {y|X = x} =                      y p(y|X = x)dy
                                                          Ω




                                                                   Nonlinear regression of a
                                                                   sinc function
                                                                   sinc(x) := sin(x)/x
                                                                   (gray) with a regression fit
                                                                   (black) based on 50 noisy
                                                                   data.

Visual Computing: Joachim M. Buhmann — Machine Learning                                 157/196
Examples of linear and nonlinear regression




                linear regression                          nonlinear regression

               How should we measure the deviations?




                                  vertical offsets        perpendicular offsets


Visual Computing: Joachim M. Buhmann — Machine Learning                           158/196
Core Questions of Pattern Recognition:
                  Unsupervised Learning

No teacher signal is available for the learning algorithm; lear-
 ning is guided by a general cost/risk function.

Examples for unsupervised learning
  1. data clustering, vector quantization:
     as in classification we search for a partitioning of objects in
     groups; but explicit labelings are not available.
  2. hierarchical data analysis; search for tree structures in data
  3. visualisation, dimension reduction

Semisupervised learning: some of the data are labeled, most
  of them are unlabeled.

Visual Computing: Joachim M. Buhmann — Machine Learning     159/196
Modes of Learning

Reinforcement Learning: weakly supervised learning
  Action chains are evaluated at the end.
  Backgammon; the neural network TD-Gammon gained the
  world championship! Quite popular in Robotics

Active Learning: Data are selected according to their expec-
  ted information gain.
  Information Filtering

Inductive Learning: the learning algorithm extracts logical ru-
  les from the data.
  Inductive Logic Programming is a popular sub area of Artifi-
  cial Intelligence


Visual Computing: Joachim M. Buhmann — Machine Learning    160/196
Vectorial Data
                                                                            Data of 20 Gaussian
       1
                                                        G
                                                                            sources in R20, pro-
                              M

            L      L L       E           M
                                                                            jected onto two di-
                  LL                ME M                   J
                                                     G
    0.5         L    LE
                    L LL
                           M
                                 EEE M
                                        MM
                                                      J
                                                          J                 mensions with Princi-
                        E                          J       G
                              NN              MJ G GGGJ J
                                                        J
                      E
                        E
                              N
                                                      KJ G
                                                            G               pal Component Ana-
                                             H        G KK      K

                        C
                                N
                                  E N H
                                     N T
                                           Q
                                                 H
                                                 K      Q
                                                             K
                                                              K
                                                                  K
                                                                   K
                                                                            lysis.
                           B     N  T
                                        H                        K
                  D C CC            NT
                                    HT         H       R
       0              D             N NT        H              Q
               P D CD T
               P D C                   H              RRQ IR
                                                        IQ I I
                D C
              D P        P            T T   H R       I I
            P     PD                T              I     IR
                 PDD P BB              H            I     R
                      C                                 Q Q R       I
                     C     B BBB S SS S           R     RQ        Q
                    P
               C P B         BBS O                    F FQ
                                SSO S              F F FF
                                                          F
                                O S    A
  -0.5                   O        S A                FF
                                 O OO
                             O OO A AA A
                                       A          A

                                             A
                                                 A
     -1
       -1            -0.5                0              0.5             1




Visual Computing: Joachim M. Buhmann — Machine Learning                                       161/196
Relational Data
 Pairwise dissimilarity
 of 145 globins which
 have been selected
 from 4 classes of
 α-globine, β-globine,
 myoglobins and glo-
 bins of insects and
 plants.




Visual Computing: Joachim M. Buhmann — Machine Learning   162/196
Scales for Data

Nominal or categorial scale: qualitative, but without quantita-
 tive measurements,
 e.g. binary scale F = {0, 1} (presence or absence of proper-
 ties like “kosher”) or
 taste categories “sweet, sour, salty and bitter.

Ordinal scale : measurement values are meaningful only with
  respect to other measurements, i.e., the rank order of mea-
  surements carries the information, not the numerical diffe-
  rences (e.g. information on the ranking of different marathon
  races!?)




Visual Computing: Joachim M. Buhmann — Machine Learning    163/196
Quantitative scale:
   • interval scale: the relation of numerical differences car-
     ries the information. Invariance w.r.t. translation and sca-
     ling (Fahrenheit scale of temperature).
   • ratio scale: zero value of the scale carries information but
     not the measurement unit. (Kelvin scale).
   • Absolute scale: Absolute values are meaningful. (grades
     of final exams)




Visual Computing: Joachim M. Buhmann — Machine Learning     164/196
Machine Learning: Topic Chart

• Core problems of pattern recognition

• Bayesian decision theory

• Perceptrons and Support vector machines

• Data clustering




Visual Computing: Joachim M. Buhmann — Machine Learning   165/196
Bayesian Decision Theory
             The Problem of Statistical Decisions

Task: textbf n objects have to be partitioned in 1, . . . , k classes,
  the doubt class D and the outlier class O.
   D : doubt class (→ new measurements required)
   O : outlier class, definitively none of the classes 1, 2, . . . , k

Objects are characterized by feature vectors X ∈ X , X ∼
 P(X) with the probability P(X = x) of feature values x.

Statistical modeling: Objects represented by data X and
  classes Y are considered to be random variables, i.e.,
  (X, Y ) ∼ P(X, Y ).
   Conceptually, it is not mandatory to consider class labels as random since they might
   be induced by legal considerations or conventions.

Visual Computing: Joachim M. Buhmann — Machine Learning                           166/196
Structure of the feature space X
   • X ⊂ Rd
   • X = X1 × X2 × · · · × Xd with Xi ⊆ R or Xi finite.
   Remark: in most situations we can define the feature space as subsets of Rd or as
   tuples of real, categorial (B = {0, 1}) or ordinal (K ⊂ K}) numbers. Sometimes we
   have more complicated data spaces composed of lists, trees or graphs.


Class density / likelihood: py (x) := P(X = x|Y = y) is equal
  to the probability of a feature value x given a class y.

Parametric Statistics: estimate the parameters of the class
  densities py (x)

Non-Parametric Statistics: minimize the empirical risk

Visual Computing: Joachim M. Buhmann — Machine Learning                       167/196
Motivation of Classification
   Given are labeled data
     Z = {(xi, yi) : i ≤ n}

   Questions:
      1. What are the class
         boundaries?
      2. What are the class
         specific   densities
         py (x)?
      3. How many modes
         or parameters do
         we need to model                        Figure: quadratic SVM classifier for five classes.
         py (x)?                                 White areas are ambiguous regions.
      4. ...

Visual Computing: Joachim M. Buhmann — Machine Learning                                    168/196
Thomas Bayes and his Terminology

The State of Nature is modelled as a random variable!




                                                prior:        P{model}
                                                likelihood:   P{data|model}
                                                posterior:    P{model|data}
                                                evidence:     P{data}



                                           P{data|model}P{model}
Bayes Rule:                P{model|data} =
                                                  P{data}

Visual Computing: Joachim M. Buhmann — Machine Learning                       169/196
Ronald A. Fisher and Frequentism

Fisher, Ronald Aylmer (1890-1962): founder of frequentist
  statistics together with Jerzey Neyman & Karl Pearson.

                                           British mathematician and biologist who in-
                                             vented revolutionary techniques for apply-
                                             ing statistics to natural sciences.

                                           Maximum likelihood method

                                           Fisher information: a measure for the infor-
                                              mation content of densities.

                                           Sampling theory

                                           Hypothesis testing



Visual Computing: Joachim M. Buhmann — Machine Learning                          170/196
Bayesianism vs. Frequentist Inference1
Bayesianism is the philosophical tenet that the mathematical theory of pro-
  bability applies to the degree of plausibility of statements, or to the degree
  of belief of rational agents in the truth of statements; together with Bayes
  theorem, it becomes Bayesian inference. The Bayesian interpretation of
  probability allows probabilities assigned to random events, but also al-
  lows the assignment of probabilities to any other kind of statement.

Bayesians assign probabilities to any statement, even when no random
  process is involved, as a way to represent its plausibility. As such, the
  scope of Bayesian inquiries include the scope of frequentist inquiries.

The limiting relative frequency of an event over a long series of trials is
  the conceptual foundation of the frequency interpretation of probability.

Frequentism rejects degree-of-belief interpretations of mathematical pro-
  bability as in Bayesianism, and assigns probabilities only to random
  events according to their relative frequencies of occurrence.
   1
       see http://encyclopedia.thefreedictionary.com/

Visual Computing: Joachim M. Buhmann — Machine Learning                   171/196
Bayes Rule for Known Densities and Parameters

Assume that we know how the features are distributed for the
  different classes, i.e., the class conditional densities and their
  parameters are known.What is the best classification strat-
  egy in this situation?

Classifier:
                                  c : X → {1, . . . , k, D}
                                  ˆ
   The assignment function c maps the feature space X to the
                                  ˆ
   set of classes {1, . . . , k, D}. (Outliers are neglected)

Quality of a classifier: Whenever a classifier returns a label
 which differs from the correct class Y = y then it has made
 a mistake.
Visual Computing: Joachim M. Buhmann — Machine Learning        172/196
Error count: The indicator function

                                                    I{ˆ(x)=y}
                                                      c
                                            x∈X

   counts the classifier mistakes. Note that this error count is a
   random variable!

Expected errors also called expected risk define the quality
  of a classifier

        R(ˆ) =
          c                 P(y)EP(x) I{ˆ(x)=y}|Y = y + terms from D
                                        c
                     y≤k


Remark: The rational behind this choice comes from gambling. If we bet on
  a particular outcome of our experiment and our gain is measured by how
  often we assign the measurements to the correct class then classifier with
  minimal expected risk will win on average against any other classification
  rule (“Dutch books”)!

Visual Computing: Joachim M. Buhmann — Machine Learning               173/196
The Loss Function

Weighted mistakes are introduced when classification errors
 are not equally costly; e.g. in medical diagnosis, some di-
 sease classes might be harmless and others might be lethal
 despite of similar symptoms.

⇒ We introduce a loss function L(y, z) which denotes the loss
 for the decision z if class y is correct.

0-1 loss: all classes are treated the same!
                    
                    0 if z = y (correct decision)
                    
                    
       L0−1(y, z) = 1 if z = y and z = D (wrong decision)
                    
                    d if z = D (no decision)
                    

Visual Computing: Joachim M. Buhmann — Machine Learning     174/196
• weighted classification costs L(y, z) ∈ R+ are frequently
  used, e.g. in medicine;
  classification costs can also be asymmetric, that means
  L(y, z) = L(z, y) ((z, y) ∼ (pancreas cancer, gastritis).

Conditional Risk function of the classifier is the expected
 loss of class y

            R(ˆ, y) = Ex [L(y, c(x))|Y = y]
              c                ˆ

      =              L(y, z)P{ˆ(x) = z|Y = y}
                              c
            z≤k


            +L(y, D)P{ˆ(x) = D|Y = y}
                      c

      = P{ˆ(x) = y ∧ c(x) = D|Y = y} + d · P{ˆ(x) = D|Y = y}
          c          ˆ                       c
                     pmc(y) probability of misclassification   pd(y) probability of doubt

Visual Computing: Joachim M. Buhmann — Machine Learning                                175/196
Total risk of the classifier: (πy := P(Y = y))

      R(ˆ) =
        c                     πz pmc(z) + d                     πz pd(z) = EC R(ˆ, C)
                                                                                c
                        z≤k                               z≤k


Asymptotic average loss

                      1
                    lim                     ˆ             ˆ c
                                     L(cj , c(xj )) = lim R(ˆ) = R(ˆ),
                                                                   c
                  n→∞ n                                         n→∞
                              j≤n


   where {(xj , cj )|1 ≤ j ≤ n} is a random sample set of size n.
   This formula can be interpreted as the expected loss with empirical distribution as probability model.




Visual Computing: Joachim M. Buhmann — Machine Learning                                               176/196
Posterior class probability

Posterior: Let

                                                                           πy py (x)
               p(y|x) ≡ P{Y = y|X = x} =
                                                                           z πz pz (x)

    be the posterior of the class y given X = x.
   (The ‘Partition of One” πy py (x)/                z    πz pz (x) results from the normalizati-
   on      z   p(z|x) = 1. )

Likelihood: The class conditional density py (x) is the probabi-
  lity of observing data X = x given class Y = y.

Prior: πy is the probability of class Y = y.


Visual Computing: Joachim M. Buhmann — Machine Learning                                    177/196
Bayes Optimal Classifier
Theorem 1 The classification rule which minimizes the total
risk for 0 − 1 loss is

                          y      if        p(y|x) = maxz≤k p(z|x) > 1 − d,
           c(x) =
                          D      if        p(y|x) ≤ 1 − d     ∀y.

Generalization to arbitrary loss functions

                     y     if         z   L(z, y)p(z|x) = minρ≤k    z   L(z, ρ)p(z|x) ≤ d,
        c(x) =
                     D     else .


Bayes classifier: Select the class with highest πy py (x) value if
  it exceeds the costs for not making a decision, i.e., πy py (x) >
  (1 − d)p(x).

Visual Computing: Joachim M. Buhmann — Machine Learning                                 178/196
Proof: Calculate the total expected loss R(ˆ)
                                           c

    R(ˆ) = EX EY L0−1(Y, c(x))|X = x
      c                  ˆ

              =           EY L0−1(Y, c(x))|X = x p(x)dx with p(x) =
                                     ˆ                                           πz pz (x)
                      X                                                    z≤k


   Minimize the conditional expectation value since it depends only on c.
                                                                       ˆ



       c(x) = argminc∈{1,...,k,D}E L0−1(Y, c)|X = x
       ˆ            ˜                      ˜

                  = argminc∈{1,...,k,D}
                          ˜                           L0−1(z, c)p(z|x)
                                                              ˜
                                                z≤k

                          argminc∈{1,...,k} (1 − p(˜|x)) if d > minc(1 − p(c|x))
                                ˜                  c
                  =
                          D                              else

                          argmaxc∈{1,...,k}p(˜|x) if 1 − d < maxc p(c|x)
                                ˜            c
                  =
                          D                       else


Visual Computing: Joachim M. Buhmann — Machine Learning                             179/196
Outliers

• Modeling by an outlier class πO with pO (x)

• “Novelty Detection”: Classify a measurement as an outlier
  if
         πO pO (x) ≥ max (1 − d)p(x), max πz pz (x)
                                                          z

• The outlier concept causes conceptual problems and it does not fit to the
  statistical decision theory since outliers indicate an erroneous or incom-
  plete specification of the statistical model!

• The outlier class is often modeled by a uniform distribution.
  Attention: Normalization of uniform distribution does not exist in many
  feature spaces!
   =⇒ Limit the support of the measurement space or put a (Gaussian)
   measure on it!

Visual Computing: Joachim M. Buhmann — Machine Learning                180/196
Class Conditional Densities and Posteriors for 2
                           Classes
          Class-conditional probability den- Posterior probabilities for priors
          sity function                      P(y1) = 2 , P(y2) = 1 .
                                                      3           3


                     p(x|ωi)                                                                P(ωi|x)
                     0.4                                                                     1
                                        ω2
                                                                                                                             ω1
                                                                                            0.8
                     0.3                                 ω1

                                                                                            0.6
                     0.2

                                                                                            0.4
                                                                                                                        ω2
                     0.1
                                                                                            0.2


                                                                           x                                                                       x
              9        10       11       12        13         14      15             9        10       11        12       13        14        15

 GURE 2.1. Hypothetical class-conditional probability density FIGURE 2.2. Posterior probabilities for the particular priors P (ω1 ) = 2/3 and
                                                                     functions show the
 obability density of measuring a particular feature value x given1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus
                                                                    = the pattern is in
  tegory ωi . If x represents the lightness of a fish, the two curves might describea the
                                                                    case, given that pattern is measured to have feature value x = 14, the probabi
 fference in lightness of populations of two types of fish. Density functions areω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x , the posterio
                                                                    in category normal-
 ed, and thus the area under each curve is 1.0. From: Richard O.to 1.0. From: E. Hart, O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifi
                                                                     Duda, Peter Richard
nd David G. Stork, Pattern Classification. Copyright c 2001 byCopyright c & Sons, John Wiley & Sons, Inc.
                                                                     John Wiley 2001 by
 c.      Visual Computing: Joachim M. Buhmann — Machine Learning                                                                   181/196
Likelihood Ratio for 2 Class Example

              p(x|ω1)
              p(x|ω2)



              θb
              θa




                                                                    x

              R2            R1             R2             R1
RE 2.3. The likelihood ratio p(x |ω1 )/p(x |ω2 ) for the distributions show
 1. IfVisual Computing: Joachim M.zero-one or Learning
        we employ a Buhmann — Machine classification loss, our decision boundari
                                                                     182/196
Discriminant Functions gl
                                               action
                                       (e.g., classification)

                                                                                 costs

              discriminant        g1(x)         g2(x)        ...    gc(x)
              functions



                input            x1        x2           x3         ...      xd


  FIGURE 2.5. The functional structure of a general statistical pattern classifier which
  includes d inputs and c discriminant functions g (x). A subsequent step determines
• Discriminant function: gz (x) = P{Y = y|Xi = x}categorizes the input pattern
  which of the discriminant values is the maximum, and
  accordingly. The arrows show the direction of the flow of information, though frequently
• Class decision: gy (x) > gzthe direction of flow is self-evident. From: Richard O.
  the arrows are omitted when (x) ∀z = y ⇒ class y.
  Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by
  John Wiley & Sons, Inc.
• Different discriminant functions can yield the same decision:
   gy (x) = log P{x|y} + log πy ; minimize implementation problems!
   ˜

Visual Computing: Joachim M. Buhmann — Machine Learning                                  183/196
Example for Discriminant Functions


                  0.3
                                     p(x|ω1)P(ω1)     p(x|ω2)P(ω2)

                   0.2

                   0.1

                     0




                                        R1             R2
                            R2
                                        decision                     5
                                        boundary
                                 5
                                                            0
                                             0



GURE 2.6. In this two-dimensional two-category classifier, the probability densitie
e Gaussian, the Joachim M. Buhmann — Machineconsists of two hyperbolas, and thus the decisio
    Visual Computing: decision boundary Learning                               184/196
Adaptation of Discriminant Functions gl
                                                                                  teacher
                                              action                              signal
                                      (e.g., classification)

                                                                                    -
                                                MAX



                discriminant g (x )             g2(x )        . . . gc(x )
                              1
                functions



                 input           x1        x2            x3       ...        xd



The red connections (weights) are adapted in such a way that the teacher
signal is imitated by the discriminant function.

Visual Computing: Joachim M. Buhmann — Machine Learning                                     185/196
Example Discriminant Functions: Normal
                    Distributions
The Likelihood of class y is Gaussian distributed.

                             1            1
        py (x) =                     exp − (x − µy )T Σ−1(x − µy )
                                                       y
                          (2π)d|Σy |      2


 Special case: Σy = σ 2I

                gy (x) = log py (x) + log πy
                            1
                       = − 2 x − µy 2 + log πy + const.
                           2σ



Visual Computing: Joachim M. Buhmann — Machine Learning              186/196
⇒ Decision surface between class z and y:
               1                                          1
            − 2 x − µz 2 + log πz                     = − 2 x − µy 2 + log πy
              2σ                                         2σ
− x 2 + 2x · µz − µz 2 + 2σ 2 log πz                  = − x 2 + 2x · µy − µy 2 + 2σ 2 log πy




                                                          2          2       2    πz
    ⇒                 2x · (µz − µy ) − µz                    + µy       + 2σ log    =0
                                                                                  πy


Linear decision rule:                           wT (x − x0) = 0

                                               1             σ 2(µz − µy )     πz
with          w = µz − µy                  x0 = (µz + µy ) −            2
                                                                           log
                                               2               µz − µy         πy


Visual Computing: Joachim M. Buhmann — Machine Learning                              187/196
Decision Surface for Gaussians in 1,2,3
                       Dimensions


                                                                                   4
                                                                          2                                                                      2
                                                                    0
                                                          -2                                         ω2                                    1
                                               0.15
                                                                                  ω1
                                                                                                                                      0

                                                0.1                                                                                                               P(ω2)=.5
        p(x|ωi)
                  ω1       ω2                                                                                        2
        0.4                                                                                                                                                             ω2
                                                0.05
                                                                                                                         1
        0.3                                           0                                                                                         ω1
                                                                                                                             0                                         R2
        0.2
                                                                                                                             -1
        0.1                                                                                               P(ω2)=.5
                                                                        P(ω1)=.5                    R2                           -2             P(ω1)=.5 R1
                                           x                                               R1                                         -2
  -2      0            2            4
                                                               -2                                                                          -1
                                                                              0                                                                      0
         R1                       R2                                                   2                                                                 1
        P(ω1)=.5                P(ω2)=.5                                                        4                                                             2



  FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identity
  matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of
  d − 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensional
  examples, we indicate p(x|ωi ) and the boundaries for the case P (ω1 ) = P (ω2 ). In the three-dimensional case,
  the grid plane separates R1 from R2 . From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern
  Classification. Copyright c 2001 by John Wiley & Sons, Inc.




Visual Computing: Joachim M. Buhmann — Machine Learning                                                                                                                      188/196
p(x|ωi)                                                                                   p(x|ωi)
                                                                              ω1            ω2                                                                             ω1                ω2
                                                            0.4                                                                                        0.4


                                                            0.3                                                                                        0.3


                                                            0.2                                                                                        0.2


                                                            0.1                                                                                        0.1


                                                                                                                        x                                                                                                 x
                                -2                             0                        2               4                   -2                           0                            2                 4

                                                                        R1                                  R2                                                         R1                                      R2
                                                                      P(ω1)=.7                        P(ω2)=.3                                                    P(ω1)=.9                                  P(ω2)=.1


                                                                                   4                                                                                            4
                                                                          2                                                                                            2
                                                                   0                                                                                          0
                                                       -2                                                                                         -2
                                                                                                                                                                                                       ω2
                                                                                   ω1                  ω2                                                                      ω1
                                     0.15                                                                                        0.15


                                       0.1                                                                                           0.1


                                      0.05                                                                                           0.05

                                              0                                                                                           0



                                                                                                                                                                                    P(ω2)=.01
                                                                                                        P(ω2)=.2
                                                                                                                                                                                                             R2
                                                                        P(ω1)=.8                         R2
                                                                                                                                                                  P(ω1)=.99
                                                                                                 R1
                                                              -2                                                                                         -2                                   R1
                                                                              0                                                                                            0
                                                                                        2                                                                                                2
                                                                                                 4                                                                                                 4




                                                                                   3                                                                                       4
                                                                          2
                                                                  1                                                                                          2

                                                       0                                              P(ω2)=.2                                0


                                      2                                                                                          2                                                                                R2
                                                                                                                   R2
                                                                                                                                                        R1
                                                                  R1
                                          1                                                                   ω2                     1
                                                                         ω1
                                                                                                                                                                  ω1                                          P(ω2)=.01
                                          0                                                                                          0
                                                                                                                                                                                             ω2
                                              -1                                                                                         -1
                                                                          P(ω1)=.8                                                                                 P(ω1)=.99
                                                  -2                                                                                      -2

                                                        -2                                                                                        -2
                                                                   -1                                                                                        -1
                                                                              0                                                                                        0
                                                                                        1                                                                                            1
                                                                                             2                                                                                                2


                        FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficiently
                        disparate priors the boundary will not lie between the means of these one-, two- and
                        three-dimensional spherical Gaussian distributions. From: Richard O. Duda, Peter E.
                        Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley &
                        Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning                                                                                                                                                                       189/196
Multi Class Case




                                                    R4   R3
                                            R2                 R4

                                                          R1



E 2.16. The decision regions for four normal distributions. Even with such a
r of Decision regions for four Gaussianboundary regionsfor such arathernum-
     categories, the shapes of the distributions. Even can be small complex. F
d O. Duda, Peter E.discriminant functionsG. Stork, Pattern Classification. Copy
     ber of classes the Hart, and David show a complex form.
1 by John Wiley & Sons, Inc.

    Visual Computing: Joachim M. Buhmann — Machine Learning         190/196
Example: Gene Expression Data
The expression of genes is measured for various patients. The
expression profiles provide information of the metabolic state of
the cells, meaning that they could be used as indicators for di-
sease classes. Each patient is represented as a vector in a high
dimensional (≈ 10000) space with Gaussian class distribution.
                                           Genes




                                                                          ALL B−Cell
Samples




                                                                  AML ALL T−Cell
                                                          Pred

                                                                            True
Visual Computing: Joachim M. Buhmann — Machine Learning          191/196
Parametric Models for Class Densities
If we would know the prior probabilities and the class conditio-
nal probabilities then we could calculate the optimal classifier.
But we don’t!

Task: Estimate p(y|x; θ) from samples Z = {(x1, y1), . . . , (xn, yn)}
  for classification.

Data are sorted according to their classes:
  Xy = {X1y , . . . , Xny ,y } where Xiy ∼ P{X|Y = y; θy }

Question: How can we use the information in samples to esti-
 mate θy ?

Assumption: classes can be separated and treated indepen-
  dently! Xy is not informative w.r.t. θz , z = y

Visual Computing: Joachim M. Buhmann — Machine Learning       192/196
Maximum Likelihood Estimation Theory

Likelihood of the data set:                        P{Xy |θy } =   i≤ny   p(xiy |θy )

                                            ˆ
Estimation principle: Select the parameters θy which maximi-
  ze the likelihood, that means

                                    ˆ
                                    θy = arg max P{Xy |θy }
                                                     θy


Procedure: Find the extreme value of the log-likelihood functi-
  on

                                         θy   log P{X |θy } = 0

                                   ∂
                                               log p(xi|θy ) = 0
                                  ∂θy
                                         i≤n

Visual Computing: Joachim M. Buhmann — Machine Learning                                193/196
Remark

Bias of an estimator:                           ˆ       ˆ
                                           bias(θn) = E{θn} − θ.

                                        ˆ
Consistent estimator: A point estimator θn of a parameter θ
                     P
                  ˆn → θ.
 is consistent if θ

Asymptotic Normality of Maximum Likelihood estimates:
   ˆ          ˆ
  (θn − θ)/ V{θn}   N (0, 1).

Alternative to ML class density estimation: discriminative
  learning by maximizing the a posteriori distribution P{θy |Xy }
   (details of the density do not have to be modelled since they might not influence the po-
   sterior)




Visual Computing: Joachim M. Buhmann — Machine Learning                              194/196
Example: Multivariate Normal Distribution

Expectation values of a normal distribution and its estimation:
   Class index has been omitted for legibility reasons (θy → θ).

                                 1       T −1          d        1
                 log p(xi|θ) = − (xi − µ) Σ (xi − µ) − log 2π − log |Σ|
                                 2                     2        2
       ∂                       1                 1                T
                 log p(xi|θ) =     Σ−1(xi − µ) +      (xi − µ)Σ−1 = 0
      ∂µ                       2                 2
           i≤n                            i≤n                       i≤n

                                                           1
        Σ−1         (xi − µ) = 0             ⇒      µn =
                                                    ˆ              xi     estimator for µ
                                                           n   i
              i≤n




Average value formula results from the quadratic form.

                      1
Unbiasedness: E[ˆn] =
                µ                            Exi = E[x] = µ
                      n
                                       i≤n


Visual Computing: Joachim M. Buhmann — Machine Learning                                     195/196
ML estimation of the variance (1d case)

       ∂                            ∂                              1             2    n
                    log p(xi|θ) = − 2                                  xi − µ        − log(2πσ 2)
      ∂σ 2                         ∂σ                              σ 2                2
              i≤n                                         i≤n
                                          1                   −4            2    n −2
                                        =                 σ        xi − µ       − σ   = 0
                                          2                                      2
                                                  i≤n
                                          1
                        ⇒        ˆ2
                                 σn     =                     xi − µ   2
                                          n
                                                  i≤n


                  ˆn = 1
Multivariate case Σ                                       (xi − µ)(xi − µ)T
                       n
                                                  i≤n


ˆ                    ˆ
Σn is biased, e.g., EΣn = Σ, if µ is unknown.



Visual Computing: Joachim M. Buhmann — Machine Learning                                     196/196

More Related Content

What's hot

Texture Unit based Approach to Discriminate Manmade Scenes from Natural Scenes
Texture Unit based Approach to Discriminate Manmade Scenes from Natural ScenesTexture Unit based Approach to Discriminate Manmade Scenes from Natural Scenes
Texture Unit based Approach to Discriminate Manmade Scenes from Natural Scenesidescitation
 
Ijarcet vol-2-issue-7-2246-2251
Ijarcet vol-2-issue-7-2246-2251Ijarcet vol-2-issue-7-2246-2251
Ijarcet vol-2-issue-7-2246-2251Editor IJARCET
 
Image Restitution Using Non-Locally Centralized Sparse Representation Model
Image Restitution Using Non-Locally Centralized Sparse Representation ModelImage Restitution Using Non-Locally Centralized Sparse Representation Model
Image Restitution Using Non-Locally Centralized Sparse Representation ModelIJERA Editor
 
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...IDES Editor
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Image feature extraction
Image feature extractionImage feature extraction
Image feature extractionRushin Shah
 
Optimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound imageOptimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound imageAlexander Decker
 
11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound imageAlexander Decker
 
An adaptive method for noise removal from real world images
An adaptive method for noise removal from real world imagesAn adaptive method for noise removal from real world images
An adaptive method for noise removal from real world imagesIAEME Publication
 
Image Representation & Descriptors
Image Representation & DescriptorsImage Representation & Descriptors
Image Representation & DescriptorsPundrikPatel
 
Texture Classification based on Gabor Wavelet
Texture Classification based on Gabor Wavelet Texture Classification based on Gabor Wavelet
Texture Classification based on Gabor Wavelet IJORCS
 
FINGERPRINTS IMAGE COMPRESSION BY WAVE ATOMS
FINGERPRINTS IMAGE COMPRESSION BY WAVE ATOMSFINGERPRINTS IMAGE COMPRESSION BY WAVE ATOMS
FINGERPRINTS IMAGE COMPRESSION BY WAVE ATOMScsandit
 

What's hot (16)

Texture Unit based Approach to Discriminate Manmade Scenes from Natural Scenes
Texture Unit based Approach to Discriminate Manmade Scenes from Natural ScenesTexture Unit based Approach to Discriminate Manmade Scenes from Natural Scenes
Texture Unit based Approach to Discriminate Manmade Scenes from Natural Scenes
 
Km2417821785
Km2417821785Km2417821785
Km2417821785
 
Ijarcet vol-2-issue-7-2246-2251
Ijarcet vol-2-issue-7-2246-2251Ijarcet vol-2-issue-7-2246-2251
Ijarcet vol-2-issue-7-2246-2251
 
Bh044365368
Bh044365368Bh044365368
Bh044365368
 
Image Restitution Using Non-Locally Centralized Sparse Representation Model
Image Restitution Using Non-Locally Centralized Sparse Representation ModelImage Restitution Using Non-Locally Centralized Sparse Representation Model
Image Restitution Using Non-Locally Centralized Sparse Representation Model
 
I0154957
I0154957I0154957
I0154957
 
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Ll2419251928
Ll2419251928Ll2419251928
Ll2419251928
 
Image feature extraction
Image feature extractionImage feature extraction
Image feature extraction
 
Optimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound imageOptimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound image
 
11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image
 
An adaptive method for noise removal from real world images
An adaptive method for noise removal from real world imagesAn adaptive method for noise removal from real world images
An adaptive method for noise removal from real world images
 
Image Representation & Descriptors
Image Representation & DescriptorsImage Representation & Descriptors
Image Representation & Descriptors
 
Texture Classification based on Gabor Wavelet
Texture Classification based on Gabor Wavelet Texture Classification based on Gabor Wavelet
Texture Classification based on Gabor Wavelet
 
FINGERPRINTS IMAGE COMPRESSION BY WAVE ATOMS
FINGERPRINTS IMAGE COMPRESSION BY WAVE ATOMSFINGERPRINTS IMAGE COMPRESSION BY WAVE ATOMS
FINGERPRINTS IMAGE COMPRESSION BY WAVE ATOMS
 

Similar to Machine Learning

Pami meanshift
Pami meanshiftPami meanshift
Pami meanshiftirisshicat
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...butest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Mit6870 orsu lecture2
Mit6870 orsu lecture2Mit6870 orsu lecture2
Mit6870 orsu lecture2zukun
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
Efficient LDI Representation (TPCG 2008)
Efficient LDI Representation (TPCG 2008)Efficient LDI Representation (TPCG 2008)
Efficient LDI Representation (TPCG 2008)Matthias Trapp
 
E0333021025
E0333021025E0333021025
E0333021025theijes
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine VisionNasir Jumani
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slidesSara Asher
 
(MS word document)
(MS word document)(MS word document)
(MS word document)butest
 
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Pirouz Nourian
 
Puneet Singla
Puneet SinglaPuneet Singla
Puneet Singlapsingla
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector MachineShao-Chuan Wang
 
Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Hsing-chuan Hsieh
 
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Lucidworks
 
Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012Christian Sandor
 
edge Linking dip.pptx
  edge Linking dip.pptx  edge Linking dip.pptx
edge Linking dip.pptxMruthyunjayaS
 
Representative Previous Work
Representative Previous WorkRepresentative Previous Work
Representative Previous Workbutest
 

Similar to Machine Learning (20)

Pami meanshift
Pami meanshiftPami meanshift
Pami meanshift
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Mit6870 orsu lecture2
Mit6870 orsu lecture2Mit6870 orsu lecture2
Mit6870 orsu lecture2
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Efficient LDI Representation (TPCG 2008)
Efficient LDI Representation (TPCG 2008)Efficient LDI Representation (TPCG 2008)
Efficient LDI Representation (TPCG 2008)
 
E0333021025
E0333021025E0333021025
E0333021025
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine Vision
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
 
Self-organizing map
Self-organizing mapSelf-organizing map
Self-organizing map
 
(MS word document)
(MS word document)(MS word document)
(MS word document)
 
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
 
Application of de
Application of deApplication of de
Application of de
 
Puneet Singla
Puneet SinglaPuneet Singla
Puneet Singla
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector Machine
 
Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)
 
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
 
Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012
 
edge Linking dip.pptx
  edge Linking dip.pptx  edge Linking dip.pptx
edge Linking dip.pptx
 
Representative Previous Work
Representative Previous WorkRepresentative Previous Work
Representative Previous Work
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Machine Learning

  • 1. Machine Learning Central Problem of Pattern Recognition: Supervised and Unsupervised Learning Classification Bayesian Decision Theory Perceptrons and SVMs Clustering Visual Computing: Joachim M. Buhmann — Machine Learning 143/196
  • 2. Machine Learning – What is the Challenge? Find optimal structure in data and validate it! Concept for Robust Data Analysis Structure Structure Data Structure optimization Validation vectors, relations, definition multiscale analysis, statistical images,... (costs, risk, ...) stochastic approximation learning theory Quantization of x solution space Regularization Information/Rate Distortion Theory Feedback of statistical & computational complexity 8 March 2006 Joachim M. Buhmann / Institute for Computational Science 3 Visual Computing: Joachim M. Buhmann — Machine Learning 144/196
  • 3. The Problem of Pattern Recognition Machine Learning (as statistics) addresses a number of chal- lenging inference problems in pattern recognition which span the range from statistical modeling to efficient algorithmics. Approximative method which yield good performance on ave- rage are particularly important. • Representation of objects. ⇒ Data representation • What is a pattern? Definition/modeling of structure. • Optimization: Search for prefered structures • Validation: are the structures indeed in the data or are they explained by fluctuations? Visual Computing: Joachim M. Buhmann — Machine Learning 145/196
  • 4. Literatur • Richard O. Duda, Peter E. Hart & David G. Stork, Pattern Classification. Wiley & Sons (2001) • Trevor Hastie, Robert Tibshirani & Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Ver- lag (2001) • Luc Devroye, Laslo Gyorfi & Gabor Lugosi, A Probabilistic Theory of Pat- ¨ tern Recognition. Springer Verlag (1996) • Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Da- ta. Springer Verlag (1983); The Nature of Statistical Learning Theory. Springer Verlag (1995) • Larry Wasserman, All of Statistics. (1st ed. 2004. Corr. 2nd printing, ISBN: 0-387-40272-1) Springer Verlag (2004) Visual Computing: Joachim M. Buhmann — Machine Learning 146/196
  • 5. The Classification Problem Visual Computing: Joachim M. Buhmann — Machine Learning 147/196
  • 6. Visual Computing: Joachim M. Buhmann — Machine Learning 148/196
  • 7. Classification as a Pattern Recognition Problem Problem: We look for a partition of the object space O (fish in the previous example) which corresponds to classification examples. Distinguish conceptually between “objects” o ∈ O and “data” x ∈ X ! Data: pairs of feature vectors and class labels Z = {(xi, yi) : 1 ≤ i ≤ n, xi ∈ Rd, yi ∈ {1, . . . , k}} Definitions: feature space X with xi ∈ X ⊂ Rd class labels yi ∈ {1, . . . , k} Classifier: mapping c : X → {1, . . . , k} k class problem: What is yn+1 ∈ {1, . . . , k} for xn+1 ∈ Rd? Visual Computing: Joachim M. Buhmann — Machine Learning 149/196
  • 8. Example of Classification Visual Computing: Joachim M. Buhmann — Machine Learning 150/196
  • 9. Histograms of Length Values salmon sea bass count 22 20 18 16 12 10 8 6 4 2 0 length 5 10 15 20 25 l* GURE 1.2. Histograms for the length feature for the two categories. No single thresh- d value of the length will serve to unambiguously discriminate between the two cat- Visual Computing: Joachim M. Buhmann — Machine Learning 151/196 ories; using length alone, we will have some errors. The value marked l ∗ will lead to
  • 10. Histograms of Skin Brightness Values count 14 salmon sea bass 12 10 8 6 4 2 0 lightness 2 4 x* 6 8 10 URE 1.3. Histograms for the lightness feature for the two categories. No single hold value x ∗ (decision boundary) will serve to unambiguously discriminate be- Visual Computing: Joachim M. Buhmann — Machine Learning 152/196 n the two categories; using lightness alone, we will have some errors. The value x ∗
  • 11. Linear Classification width 22 salmon sea bass 21 20 19 18 17 16 15 14 lightness 2 4 6 8 10 URE 1.4. The two features of lightness and width for sea bass and salmon. The d could serve as Joachim M. Buhmann — Machine Learning our classifier. Overall classification error Visual Computing: a decision boundary of 153/196
  • 12. Overfitting width 22 salmon sea bass 21 20 19 18 ? 17 16 15 14 lightness 2 4 6 8 10 1.5. Overly complex models for the fish will lead to decision bounda plicated. While such a decision may lead to perfect classification of our Visual Computing: Joachim M. Buhmann — Machine Learning 154/196 it would lead to poor performance on future patterns. The novel te
  • 13. Optimized Non-Linear Classification width 22 salmon sea bass 21 20 19 18 17 16 15 14 lightness 2 4 6 8 10 1.6. The razor argument: Entia non sunt multiplicanda praeter necessitatem! optimal trad Occam’s decision boundary shown might represent the erformance on the training set and simplicity of classifier, thereby gi Visual Computing: Joachim M. Buhmann — Machine Learning 155/196 ccuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and D
  • 14. Regression (see Introduction to Machine Learning) Question: Given a feature (vector) xi and a corre- sponding noisy measure- ment of a function value yi = f (xi) + noise, what is the unknown function f (.) in a hypothesis class H? Data: Z = {(xi, yi) ∈ Rd × R : 1 ≤ i ≤ n} Modeling choice: What is an adequate hypothesis class and a good noise model? Fitting with linear/nonlinear functions? Visual Computing: Joachim M. Buhmann — Machine Learning 156/196
  • 15. The Regression Function Questions: (i) What is the statistically optimal estimate of a function f : Rd → R and (ii) which algorithm achieves this goal most efficiently? Solution to (i): the regression function y(x) = E {y|X = x} = y p(y|X = x)dy Ω Nonlinear regression of a sinc function sinc(x) := sin(x)/x (gray) with a regression fit (black) based on 50 noisy data. Visual Computing: Joachim M. Buhmann — Machine Learning 157/196
  • 16. Examples of linear and nonlinear regression linear regression nonlinear regression How should we measure the deviations? vertical offsets perpendicular offsets Visual Computing: Joachim M. Buhmann — Machine Learning 158/196
  • 17. Core Questions of Pattern Recognition: Unsupervised Learning No teacher signal is available for the learning algorithm; lear- ning is guided by a general cost/risk function. Examples for unsupervised learning 1. data clustering, vector quantization: as in classification we search for a partitioning of objects in groups; but explicit labelings are not available. 2. hierarchical data analysis; search for tree structures in data 3. visualisation, dimension reduction Semisupervised learning: some of the data are labeled, most of them are unlabeled. Visual Computing: Joachim M. Buhmann — Machine Learning 159/196
  • 18. Modes of Learning Reinforcement Learning: weakly supervised learning Action chains are evaluated at the end. Backgammon; the neural network TD-Gammon gained the world championship! Quite popular in Robotics Active Learning: Data are selected according to their expec- ted information gain. Information Filtering Inductive Learning: the learning algorithm extracts logical ru- les from the data. Inductive Logic Programming is a popular sub area of Artifi- cial Intelligence Visual Computing: Joachim M. Buhmann — Machine Learning 160/196
  • 19. Vectorial Data Data of 20 Gaussian 1 G sources in R20, pro- M L L L E M jected onto two di- LL ME M J G 0.5 L LE L LL M EEE M MM J J mensions with Princi- E J G NN MJ G GGGJ J J E E N KJ G G pal Component Ana- H G KK K C N E N H N T Q H K Q K K K K lysis. B N T H K D C CC NT HT H R 0 D N NT H Q P D CD T P D C H RRQ IR IQ I I D C D P P T T H R I I P PD T I IR PDD P BB H I R C Q Q R I C B BBB S SS S R RQ Q P C P B BBS O F FQ SSO S F F FF F O S A -0.5 O S A FF O OO O OO A AA A A A A A -1 -1 -0.5 0 0.5 1 Visual Computing: Joachim M. Buhmann — Machine Learning 161/196
  • 20. Relational Data Pairwise dissimilarity of 145 globins which have been selected from 4 classes of α-globine, β-globine, myoglobins and glo- bins of insects and plants. Visual Computing: Joachim M. Buhmann — Machine Learning 162/196
  • 21. Scales for Data Nominal or categorial scale: qualitative, but without quantita- tive measurements, e.g. binary scale F = {0, 1} (presence or absence of proper- ties like “kosher”) or taste categories “sweet, sour, salty and bitter. Ordinal scale : measurement values are meaningful only with respect to other measurements, i.e., the rank order of mea- surements carries the information, not the numerical diffe- rences (e.g. information on the ranking of different marathon races!?) Visual Computing: Joachim M. Buhmann — Machine Learning 163/196
  • 22. Quantitative scale: • interval scale: the relation of numerical differences car- ries the information. Invariance w.r.t. translation and sca- ling (Fahrenheit scale of temperature). • ratio scale: zero value of the scale carries information but not the measurement unit. (Kelvin scale). • Absolute scale: Absolute values are meaningful. (grades of final exams) Visual Computing: Joachim M. Buhmann — Machine Learning 164/196
  • 23. Machine Learning: Topic Chart • Core problems of pattern recognition • Bayesian decision theory • Perceptrons and Support vector machines • Data clustering Visual Computing: Joachim M. Buhmann — Machine Learning 165/196
  • 24. Bayesian Decision Theory The Problem of Statistical Decisions Task: textbf n objects have to be partitioned in 1, . . . , k classes, the doubt class D and the outlier class O. D : doubt class (→ new measurements required) O : outlier class, definitively none of the classes 1, 2, . . . , k Objects are characterized by feature vectors X ∈ X , X ∼ P(X) with the probability P(X = x) of feature values x. Statistical modeling: Objects represented by data X and classes Y are considered to be random variables, i.e., (X, Y ) ∼ P(X, Y ). Conceptually, it is not mandatory to consider class labels as random since they might be induced by legal considerations or conventions. Visual Computing: Joachim M. Buhmann — Machine Learning 166/196
  • 25. Structure of the feature space X • X ⊂ Rd • X = X1 × X2 × · · · × Xd with Xi ⊆ R or Xi finite. Remark: in most situations we can define the feature space as subsets of Rd or as tuples of real, categorial (B = {0, 1}) or ordinal (K ⊂ K}) numbers. Sometimes we have more complicated data spaces composed of lists, trees or graphs. Class density / likelihood: py (x) := P(X = x|Y = y) is equal to the probability of a feature value x given a class y. Parametric Statistics: estimate the parameters of the class densities py (x) Non-Parametric Statistics: minimize the empirical risk Visual Computing: Joachim M. Buhmann — Machine Learning 167/196
  • 26. Motivation of Classification Given are labeled data Z = {(xi, yi) : i ≤ n} Questions: 1. What are the class boundaries? 2. What are the class specific densities py (x)? 3. How many modes or parameters do we need to model Figure: quadratic SVM classifier for five classes. py (x)? White areas are ambiguous regions. 4. ... Visual Computing: Joachim M. Buhmann — Machine Learning 168/196
  • 27. Thomas Bayes and his Terminology The State of Nature is modelled as a random variable! prior: P{model} likelihood: P{data|model} posterior: P{model|data} evidence: P{data} P{data|model}P{model} Bayes Rule: P{model|data} = P{data} Visual Computing: Joachim M. Buhmann — Machine Learning 169/196
  • 28. Ronald A. Fisher and Frequentism Fisher, Ronald Aylmer (1890-1962): founder of frequentist statistics together with Jerzey Neyman & Karl Pearson. British mathematician and biologist who in- vented revolutionary techniques for apply- ing statistics to natural sciences. Maximum likelihood method Fisher information: a measure for the infor- mation content of densities. Sampling theory Hypothesis testing Visual Computing: Joachim M. Buhmann — Machine Learning 170/196
  • 29. Bayesianism vs. Frequentist Inference1 Bayesianism is the philosophical tenet that the mathematical theory of pro- bability applies to the degree of plausibility of statements, or to the degree of belief of rational agents in the truth of statements; together with Bayes theorem, it becomes Bayesian inference. The Bayesian interpretation of probability allows probabilities assigned to random events, but also al- lows the assignment of probabilities to any other kind of statement. Bayesians assign probabilities to any statement, even when no random process is involved, as a way to represent its plausibility. As such, the scope of Bayesian inquiries include the scope of frequentist inquiries. The limiting relative frequency of an event over a long series of trials is the conceptual foundation of the frequency interpretation of probability. Frequentism rejects degree-of-belief interpretations of mathematical pro- bability as in Bayesianism, and assigns probabilities only to random events according to their relative frequencies of occurrence. 1 see http://encyclopedia.thefreedictionary.com/ Visual Computing: Joachim M. Buhmann — Machine Learning 171/196
  • 30. Bayes Rule for Known Densities and Parameters Assume that we know how the features are distributed for the different classes, i.e., the class conditional densities and their parameters are known.What is the best classification strat- egy in this situation? Classifier: c : X → {1, . . . , k, D} ˆ The assignment function c maps the feature space X to the ˆ set of classes {1, . . . , k, D}. (Outliers are neglected) Quality of a classifier: Whenever a classifier returns a label which differs from the correct class Y = y then it has made a mistake. Visual Computing: Joachim M. Buhmann — Machine Learning 172/196
  • 31. Error count: The indicator function I{ˆ(x)=y} c x∈X counts the classifier mistakes. Note that this error count is a random variable! Expected errors also called expected risk define the quality of a classifier R(ˆ) = c P(y)EP(x) I{ˆ(x)=y}|Y = y + terms from D c y≤k Remark: The rational behind this choice comes from gambling. If we bet on a particular outcome of our experiment and our gain is measured by how often we assign the measurements to the correct class then classifier with minimal expected risk will win on average against any other classification rule (“Dutch books”)! Visual Computing: Joachim M. Buhmann — Machine Learning 173/196
  • 32. The Loss Function Weighted mistakes are introduced when classification errors are not equally costly; e.g. in medical diagnosis, some di- sease classes might be harmless and others might be lethal despite of similar symptoms. ⇒ We introduce a loss function L(y, z) which denotes the loss for the decision z if class y is correct. 0-1 loss: all classes are treated the same!  0 if z = y (correct decision)   L0−1(y, z) = 1 if z = y and z = D (wrong decision)  d if z = D (no decision)  Visual Computing: Joachim M. Buhmann — Machine Learning 174/196
  • 33. • weighted classification costs L(y, z) ∈ R+ are frequently used, e.g. in medicine; classification costs can also be asymmetric, that means L(y, z) = L(z, y) ((z, y) ∼ (pancreas cancer, gastritis). Conditional Risk function of the classifier is the expected loss of class y R(ˆ, y) = Ex [L(y, c(x))|Y = y] c ˆ = L(y, z)P{ˆ(x) = z|Y = y} c z≤k +L(y, D)P{ˆ(x) = D|Y = y} c = P{ˆ(x) = y ∧ c(x) = D|Y = y} + d · P{ˆ(x) = D|Y = y} c ˆ c pmc(y) probability of misclassification pd(y) probability of doubt Visual Computing: Joachim M. Buhmann — Machine Learning 175/196
  • 34. Total risk of the classifier: (πy := P(Y = y)) R(ˆ) = c πz pmc(z) + d πz pd(z) = EC R(ˆ, C) c z≤k z≤k Asymptotic average loss 1 lim ˆ ˆ c L(cj , c(xj )) = lim R(ˆ) = R(ˆ), c n→∞ n n→∞ j≤n where {(xj , cj )|1 ≤ j ≤ n} is a random sample set of size n. This formula can be interpreted as the expected loss with empirical distribution as probability model. Visual Computing: Joachim M. Buhmann — Machine Learning 176/196
  • 35. Posterior class probability Posterior: Let πy py (x) p(y|x) ≡ P{Y = y|X = x} = z πz pz (x) be the posterior of the class y given X = x. (The ‘Partition of One” πy py (x)/ z πz pz (x) results from the normalizati- on z p(z|x) = 1. ) Likelihood: The class conditional density py (x) is the probabi- lity of observing data X = x given class Y = y. Prior: πy is the probability of class Y = y. Visual Computing: Joachim M. Buhmann — Machine Learning 177/196
  • 36. Bayes Optimal Classifier Theorem 1 The classification rule which minimizes the total risk for 0 − 1 loss is y if p(y|x) = maxz≤k p(z|x) > 1 − d, c(x) = D if p(y|x) ≤ 1 − d ∀y. Generalization to arbitrary loss functions y if z L(z, y)p(z|x) = minρ≤k z L(z, ρ)p(z|x) ≤ d, c(x) = D else . Bayes classifier: Select the class with highest πy py (x) value if it exceeds the costs for not making a decision, i.e., πy py (x) > (1 − d)p(x). Visual Computing: Joachim M. Buhmann — Machine Learning 178/196
  • 37. Proof: Calculate the total expected loss R(ˆ) c R(ˆ) = EX EY L0−1(Y, c(x))|X = x c ˆ = EY L0−1(Y, c(x))|X = x p(x)dx with p(x) = ˆ πz pz (x) X z≤k Minimize the conditional expectation value since it depends only on c. ˆ c(x) = argminc∈{1,...,k,D}E L0−1(Y, c)|X = x ˆ ˜ ˜ = argminc∈{1,...,k,D} ˜ L0−1(z, c)p(z|x) ˜ z≤k argminc∈{1,...,k} (1 − p(˜|x)) if d > minc(1 − p(c|x)) ˜ c = D else argmaxc∈{1,...,k}p(˜|x) if 1 − d < maxc p(c|x) ˜ c = D else Visual Computing: Joachim M. Buhmann — Machine Learning 179/196
  • 38. Outliers • Modeling by an outlier class πO with pO (x) • “Novelty Detection”: Classify a measurement as an outlier if πO pO (x) ≥ max (1 − d)p(x), max πz pz (x) z • The outlier concept causes conceptual problems and it does not fit to the statistical decision theory since outliers indicate an erroneous or incom- plete specification of the statistical model! • The outlier class is often modeled by a uniform distribution. Attention: Normalization of uniform distribution does not exist in many feature spaces! =⇒ Limit the support of the measurement space or put a (Gaussian) measure on it! Visual Computing: Joachim M. Buhmann — Machine Learning 180/196
  • 39. Class Conditional Densities and Posteriors for 2 Classes Class-conditional probability den- Posterior probabilities for priors sity function P(y1) = 2 , P(y2) = 1 . 3 3 p(x|ωi) P(ωi|x) 0.4 1 ω2 ω1 0.8 0.3 ω1 0.6 0.2 0.4 ω2 0.1 0.2 x x 9 10 11 12 13 14 15 9 10 11 12 13 14 15 GURE 2.1. Hypothetical class-conditional probability density FIGURE 2.2. Posterior probabilities for the particular priors P (ω1 ) = 2/3 and functions show the obability density of measuring a particular feature value x given1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus = the pattern is in tegory ωi . If x represents the lightness of a fish, the two curves might describea the case, given that pattern is measured to have feature value x = 14, the probabi fference in lightness of populations of two types of fish. Density functions areω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x , the posterio in category normal- ed, and thus the area under each curve is 1.0. From: Richard O.to 1.0. From: E. Hart, O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifi Duda, Peter Richard nd David G. Stork, Pattern Classification. Copyright c 2001 byCopyright c & Sons, John Wiley & Sons, Inc. John Wiley 2001 by c. Visual Computing: Joachim M. Buhmann — Machine Learning 181/196
  • 40. Likelihood Ratio for 2 Class Example p(x|ω1) p(x|ω2) θb θa x R2 R1 R2 R1 RE 2.3. The likelihood ratio p(x |ω1 )/p(x |ω2 ) for the distributions show 1. IfVisual Computing: Joachim M.zero-one or Learning we employ a Buhmann — Machine classification loss, our decision boundari 182/196
  • 41. Discriminant Functions gl action (e.g., classification) costs discriminant g1(x) g2(x) ... gc(x) functions input x1 x2 x3 ... xd FIGURE 2.5. The functional structure of a general statistical pattern classifier which includes d inputs and c discriminant functions g (x). A subsequent step determines • Discriminant function: gz (x) = P{Y = y|Xi = x}categorizes the input pattern which of the discriminant values is the maximum, and accordingly. The arrows show the direction of the flow of information, though frequently • Class decision: gy (x) > gzthe direction of flow is self-evident. From: Richard O. the arrows are omitted when (x) ∀z = y ⇒ class y. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. • Different discriminant functions can yield the same decision: gy (x) = log P{x|y} + log πy ; minimize implementation problems! ˜ Visual Computing: Joachim M. Buhmann — Machine Learning 183/196
  • 42. Example for Discriminant Functions 0.3 p(x|ω1)P(ω1) p(x|ω2)P(ω2) 0.2 0.1 0 R1 R2 R2 decision 5 boundary 5 0 0 GURE 2.6. In this two-dimensional two-category classifier, the probability densitie e Gaussian, the Joachim M. Buhmann — Machineconsists of two hyperbolas, and thus the decisio Visual Computing: decision boundary Learning 184/196
  • 43. Adaptation of Discriminant Functions gl teacher action signal (e.g., classification) - MAX discriminant g (x ) g2(x ) . . . gc(x ) 1 functions input x1 x2 x3 ... xd The red connections (weights) are adapted in such a way that the teacher signal is imitated by the discriminant function. Visual Computing: Joachim M. Buhmann — Machine Learning 185/196
  • 44. Example Discriminant Functions: Normal Distributions The Likelihood of class y is Gaussian distributed. 1 1 py (x) = exp − (x − µy )T Σ−1(x − µy ) y (2π)d|Σy | 2 Special case: Σy = σ 2I gy (x) = log py (x) + log πy 1 = − 2 x − µy 2 + log πy + const. 2σ Visual Computing: Joachim M. Buhmann — Machine Learning 186/196
  • 45. ⇒ Decision surface between class z and y: 1 1 − 2 x − µz 2 + log πz = − 2 x − µy 2 + log πy 2σ 2σ − x 2 + 2x · µz − µz 2 + 2σ 2 log πz = − x 2 + 2x · µy − µy 2 + 2σ 2 log πy 2 2 2 πz ⇒ 2x · (µz − µy ) − µz + µy + 2σ log =0 πy Linear decision rule: wT (x − x0) = 0 1 σ 2(µz − µy ) πz with w = µz − µy x0 = (µz + µy ) − 2 log 2 µz − µy πy Visual Computing: Joachim M. Buhmann — Machine Learning 187/196
  • 46. Decision Surface for Gaussians in 1,2,3 Dimensions 4 2 2 0 -2 ω2 1 0.15 ω1 0 0.1 P(ω2)=.5 p(x|ωi) ω1 ω2 2 0.4 ω2 0.05 1 0.3 0 ω1 0 R2 0.2 -1 0.1 P(ω2)=.5 P(ω1)=.5 R2 -2 P(ω1)=.5 R1 x R1 -2 -2 0 2 4 -2 -1 0 0 R1 R2 2 1 P(ω1)=.5 P(ω2)=.5 4 2 FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identity matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of d − 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensional examples, we indicate p(x|ωi ) and the boundaries for the case P (ω1 ) = P (ω2 ). In the three-dimensional case, the grid plane separates R1 from R2 . From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Visual Computing: Joachim M. Buhmann — Machine Learning 188/196
  • 47. p(x|ωi) p(x|ωi) ω1 ω2 ω1 ω2 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 x x -2 0 2 4 -2 0 2 4 R1 R2 R1 R2 P(ω1)=.7 P(ω2)=.3 P(ω1)=.9 P(ω2)=.1 4 4 2 2 0 0 -2 -2 ω2 ω1 ω2 ω1 0.15 0.15 0.1 0.1 0.05 0.05 0 0 P(ω2)=.01 P(ω2)=.2 R2 P(ω1)=.8 R2 P(ω1)=.99 R1 -2 -2 R1 0 0 2 2 4 4 3 4 2 1 2 0 P(ω2)=.2 0 2 2 R2 R2 R1 R1 1 ω2 1 ω1 ω1 P(ω2)=.01 0 0 ω2 -1 -1 P(ω1)=.8 P(ω1)=.99 -2 -2 -2 -2 -1 -1 0 0 1 1 2 2 FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficiently disparate priors the boundary will not lie between the means of these one-, two- and three-dimensional spherical Gaussian distributions. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Visual Computing: Joachim M. Buhmann — Machine Learning 189/196
  • 48. Multi Class Case R4 R3 R2 R4 R1 E 2.16. The decision regions for four normal distributions. Even with such a r of Decision regions for four Gaussianboundary regionsfor such arathernum- categories, the shapes of the distributions. Even can be small complex. F d O. Duda, Peter E.discriminant functionsG. Stork, Pattern Classification. Copy ber of classes the Hart, and David show a complex form. 1 by John Wiley & Sons, Inc. Visual Computing: Joachim M. Buhmann — Machine Learning 190/196
  • 49. Example: Gene Expression Data The expression of genes is measured for various patients. The expression profiles provide information of the metabolic state of the cells, meaning that they could be used as indicators for di- sease classes. Each patient is represented as a vector in a high dimensional (≈ 10000) space with Gaussian class distribution. Genes ALL B−Cell Samples AML ALL T−Cell Pred True Visual Computing: Joachim M. Buhmann — Machine Learning 191/196
  • 50. Parametric Models for Class Densities If we would know the prior probabilities and the class conditio- nal probabilities then we could calculate the optimal classifier. But we don’t! Task: Estimate p(y|x; θ) from samples Z = {(x1, y1), . . . , (xn, yn)} for classification. Data are sorted according to their classes: Xy = {X1y , . . . , Xny ,y } where Xiy ∼ P{X|Y = y; θy } Question: How can we use the information in samples to esti- mate θy ? Assumption: classes can be separated and treated indepen- dently! Xy is not informative w.r.t. θz , z = y Visual Computing: Joachim M. Buhmann — Machine Learning 192/196
  • 51. Maximum Likelihood Estimation Theory Likelihood of the data set: P{Xy |θy } = i≤ny p(xiy |θy ) ˆ Estimation principle: Select the parameters θy which maximi- ze the likelihood, that means ˆ θy = arg max P{Xy |θy } θy Procedure: Find the extreme value of the log-likelihood functi- on θy log P{X |θy } = 0 ∂ log p(xi|θy ) = 0 ∂θy i≤n Visual Computing: Joachim M. Buhmann — Machine Learning 193/196
  • 52. Remark Bias of an estimator: ˆ ˆ bias(θn) = E{θn} − θ. ˆ Consistent estimator: A point estimator θn of a parameter θ P ˆn → θ. is consistent if θ Asymptotic Normality of Maximum Likelihood estimates: ˆ ˆ (θn − θ)/ V{θn} N (0, 1). Alternative to ML class density estimation: discriminative learning by maximizing the a posteriori distribution P{θy |Xy } (details of the density do not have to be modelled since they might not influence the po- sterior) Visual Computing: Joachim M. Buhmann — Machine Learning 194/196
  • 53. Example: Multivariate Normal Distribution Expectation values of a normal distribution and its estimation: Class index has been omitted for legibility reasons (θy → θ). 1 T −1 d 1 log p(xi|θ) = − (xi − µ) Σ (xi − µ) − log 2π − log |Σ| 2 2 2 ∂ 1 1 T log p(xi|θ) = Σ−1(xi − µ) + (xi − µ)Σ−1 = 0 ∂µ 2 2 i≤n i≤n i≤n 1 Σ−1 (xi − µ) = 0 ⇒ µn = ˆ xi estimator for µ n i i≤n Average value formula results from the quadratic form. 1 Unbiasedness: E[ˆn] = µ Exi = E[x] = µ n i≤n Visual Computing: Joachim M. Buhmann — Machine Learning 195/196
  • 54. ML estimation of the variance (1d case) ∂ ∂ 1 2 n log p(xi|θ) = − 2 xi − µ − log(2πσ 2) ∂σ 2 ∂σ σ 2 2 i≤n i≤n 1 −4 2 n −2 = σ xi − µ − σ = 0 2 2 i≤n 1 ⇒ ˆ2 σn = xi − µ 2 n i≤n ˆn = 1 Multivariate case Σ (xi − µ)(xi − µ)T n i≤n ˆ ˆ Σn is biased, e.g., EΣn = Σ, if µ is unknown. Visual Computing: Joachim M. Buhmann — Machine Learning 196/196