SlideShare a Scribd company logo
1 of 52
Download to read offline
Single-Layer Perceptron
       Classifiers


     Berlin Chen, 2002
Outline
• Foundations of trainable decision-making
  networks to be formulated
  – Input space to output space (classification space)
• Focus on the classification of linearly separable
  classes of patterns
  – Linear discriminating functions and simple correction
    function
  – Continuous error function minimization
• Explanation and justification of perceptron and
  delta training rules

                                                            2
Classification Model, Features,
          and Decision Regions
• A pattern is the quantitative description of an
  object, event, or phenomenon
   – Spatial patterns: weather maps, fingerprints …
   – Temporal patterns: speech signals …


• Pattern classification/recognition
   – Assign the input data (a physical object, event, or
     phenomenon) to one of the pre-specified classes
     (categories)
   – Discriminate the input data within object population
     via the search for invariant attributes among
     members of the population                              3
Classification Model, Features,
             and Decision Regions (cont.)
   • The block diagram of the recognition and
     classification system



                                          Dimension
                                          reduction
A neural network
for classification
 and for feature
    extraction



                                                      4
Classification Model, Features,
      and Decision Regions (cont.)
• More about Feature Extraction
  – The compressed data from the input patterns while
    poses salient information
  – E.g.
     • Speech vowel sounds analyzed in 16-channel filterbanks can
       provide 16 spectral vectors, which can be further transformed
       into two dimensions
         – Tone height (high-low) and retraction (front-back)

     • Input patterns to be projected and reduced to lower
       dimensions



                                                                       5
Classification Model, Features,
      and Decision Regions (cont.)
• More about Feature Extraction
                              y        x’


                         y’




                                            x




                                                6
Classification Model, Features,
       and Decision Regions (cont.)
• Two simple ways to generate the pattern vectors for
  cases of spatial and temporal objects to be classified




• A pattern classifier maps input patterns (vectors) in En
  space into numbers (E1) which specify the membership
                 j = i0 ( x ), j = 1, 2,..., R
                                                             7
Classification Model, Features,
       and Decision Regions (cont.)
• Classification described in geometric terms



                                                The decision surfaces here
                                                are curved lines


                                     i o ( x ) = j , for all x ∈ Χ j ,   j = 1, 2 ,..., R




   – Decision regions
   – Decision surfaces: generally, the decision surfaces for n-
     dimensional patterns may be (n-1)-dimensional hyper-surfaces                           8
Discriminant Functions
  • Determine the membership in a category by the
    classifier based on the comparison of R
    discriminant functions g1(x), g2(x),…, gR(x)
      – When x is within the region Xk if gk(x) has the largest
        value i0 ( x ) = k if g k ( x ) > g j ( x ) for k, j = 1, 2 ,..., R, k ≠ j
                                             g1
                           x1                              g1(x)
x1, x2,…., xp, ….,xP       x2                g2
                                                   g2(x)

P>>n                                                       gR(x)
                          xn
Assume the classifier
                                             gR
Has been designed


                                                                                     9
Discriminant Functions (cont.)

• Example 3.1   Decision surface Equation:   g ( x ) = g1 ( x ) − g 2 ( x )
                                                    = -2 x1 + x2 + 2
                                             g ( x ) > 0 : class1
                                             g ( x ) < 0 : class 2


                                      The decision surface does
                                        not uniquely specify the
                                         discriminant functions


                                The classifier that classifies patterns
                                 into two classes or categories is called
                                “dichotomizer”

                                 “two” “cut”                                  10
Discriminant Functions (cont.)




                                 11
Discriminant Functions (cont.)
(x-0,y+2, g1 -1)(2,-1,1)=0   Solution 1
2x-y-2+ g1 -1=0                                           g
                                                x1 
                             g1 ( x ) = [− 2 1]  + 3
g1=-2x+y+3
(x-0,y+2, g2 -1)(-2,1,1)=0                      x2 
-2x+y+2+ g2 -1=0
                                                x1 
                             g 2 ( x ) = [2 -1] 
g2=2x-y-1
g=g1 -g2=0                                      x2 
-4x+2y+4=0
-2x+y+2=0                                             y                        [2,-1,1]
                                           [-2,1,0]
                                                                                          x
(x-0,y+2, g1 -1)(2,-1,2)=0   Solution 2                                                    [0,0,1]
2x-y-2+2g1 -2=0                                                            (1,0,0)
g1=-x+1/2y+2                                                  (0,-2,1)
(x-0,y+2, g2 -1)(-2,1,2)=0                                                                [2,-1,0]
-2x+y+2+ 2g2 -2=0                                                        (0,-2,0)
g2=x-1/2y
                                     An infinite number of
 g=g1 -g2=0
 -2x+y+2=0                      discriminant functions will yield
                                      correct classification                                     12
Discriminant Functions (cont.)
Multi-class




Two-class




              g( x) = g1 ( x) − g2 ( x)      g ( x ) > 0 : class 1
                                             g ( x ) < 0 : class 2
              subtraction                 Sign examination           13
Discriminant Functions (cont.)




                     The design of discriminator
                     for this case is not
                     straightforward.
                     The discriminant functions
                     may result as nonlinear
                      functions of x1 and x2




                                                   14
Bayes’ Decision Theory
• A decision-making based on both the posterior
  knowledge obtained from specific observation
  data and prior knowledge of the categories
   – Prior class probabilities P(ωi ), ∀ class i
  – Class-conditioned probabilities P(x ωi ), ∀ class i

                                        P (x ω i )P (ω i )                 P (x ω i )P (ω i )
    k = arg max P (ω i x ) = arg max                         = arg max
             i                    i          P (x )               i
                                                                         j =1
                                                                                (    )
                                                                         ∑ P x ω j P (ω j )

    k = arg max P (ω i x ) = arg max P (x ω i )P (ω i )
             i                    i




                                                                                                15
Bayes’ Decision Theory (cont.)
• Bayes’ decision rule designed to minimize the
  overall risk involved in making decision
  – The expected loss (conditional risk) when making
    decision δ i
     R (δ x ) = ∑ l (δ ω , x )P (ω x ), where l (δ ω , x ) = 
                                                             0 , i = j
          i                 i       j    j               i   j
                      j                                             1, i ≠ j
                   = ∑ P (ω     j   x)
                      j≠i

                   = 1 - P (ω i x )

      • The overall risk (Bayes’ risk)
              ∞
        R = ∫ R (δ ( x ) x )p ( x )dx , δ ( x ) : the selected decision for a sample x
              −∞
  – Minimize the overall risk (classification error) by
    computing the conditional risks and select the decision
     δ i for which the conditional risk R (δ i x ) is minimum, i.e.,
    P (ω i x ) is maximum (minimum-error-rate decision rule) 16
Bayes’ Decision Theory (cont.)
   • Two-class pattern classification
      g 1 ( x ) = P (ω 1 x ) ≅ P (x ω 1 )P (ω 1 ), g 2 (x ) = P (ω 2 x ) ≅ P (x ω 2 )P (ω 2 )

 Bayes’ Classifier                                Likelihood ratio or log-likelihood ratio:
                                                                    ω1
                 ω1                                         P(x ω1 ) > P(ω2 )
                  >                              l (x ) =
P (x ω 1 )P (ω 1 ) P (x ω 2 )P (ω 2 )                       P(x ω2 ) < P(ω1 )            ω1
                  <                                                 ω2
                                                                                         >
                 ω2
                                                log l ( x ) = log P(x ω1 ) − log P(x ω2 ) log P(ω2 ) − log P(ω1 )
                                                                                         <
                                                                                         ω2


                                                       Classification error:
                                                     p (error    ) = P ( x ∈ R1 , ω 2 ) + P ( x ∈ R 2 , ω 1 )
                                                                  = P (x ∈ R1 ω 2 )P (ω 2 ) + P (x ∈ R 2 ω 1 )P (ω 1 )
                                                                  = ∫R P (x ω 2 )P (ω 2 )dx + ∫R P (x ω 1 )P (ω 1 )dx
                                                                         1                      2




                                                                                                                    17
Bayes’ Decision Theory (cont.)
• When the environment is multivariate Gaussian,
  the Bayes’ classifier reduces to a linear classifier
    – The same form taken by the perceptron
    – But the linear nature of the perceptron is not
      contingent on the assumption of Gaussianity

         P (x ω ) =
                                                        1
                                   1
                                                   exp  − ( x − µ ) Σ
                                                                    t    −1
                                                                              ( x − µ )
                                                                                       
                          (2 π )
                                           1
                               n
                                   2   Σ       2        2                            

        Class ω 1 : E [ X ] = µ1
                      [
                   E ( X − µ1 )( X − µ1 ) = Σ
                                                     t
                                                         ]         P (ω 1 ) = P (ω 2 ) =
                                                                                           1
                                                                                           2
        Class ω 2 : E [ X ] = µ 2
                      [
                   E ( X − µ 2 )( X − µ 2 ) = Σ
                                                         t
                                                             ]
Assumptions                                                                                    18
Bayes’ Decision Theory (cont.)

• When the environment is Gaussian, the Bayes’
  classifier reduces to a linear classifier (cont.)
     log l ( x ) = log P (x ω1 ) − log P (x ω 2 )
                1
           =−     ( x − µ1 )t Σ −1 ( x − µ1 ) + 1 ( x − µ2 )t Σ −1 ( x − µ2 )
                2                               2
                                      (
                                     1 t
           = ( µ1 − µ 2 ) Σ −1 x + µ 2 Σ −1 µ 2 − µ1 Σ −1 µ1
                         t

                                     2
                                                        t
                                                                )
           = wx + b
                              ω1
                          >
    ∴ log l ( x ) = wx + b 0
                          <
                              ω2

                                                                                19
Bayes’ Decision Theory (cont.)

• Multi-class pattern classification




                                       20
Linear Machine and Minimum Distance
             Classification
• Find the linear-form discriminant function for two-
  class classification when the class prototypes are
  known

• Example 3.1: Select the decision hyperplane that
  contains the midpoint of the line segment
  connecting center point of two classes




                                                        21
Linear Machine and Minimum Distance
        Classification (cont.)
The dichotomizer’s discriminant function g(x):
                                                         x1 + x 2
                                      ( x1 − x 2 ) t ( x −        )=0
                                                            2
                                                        1      2      2
                                      ( x1 − x 2 ) t x + ( x 2 − x1 ) = 0
                                                        2
           x1 + x 2
                                               w x
                                     Taken as                 = 0 , where
                                                w n +1   1 
              2
                                                       
                                     w = x1 − x 2

                                     w n +1 =
                                                1
                                                2
                                                 (x2
                                                        2
                                                             − x1
                                                                    2
                                                                        )   Augmented
                                                                            input pattern




       It is a simple minimum-distance classifier.
                                                                                            22
Linear Machine and Minimum Distance
        Classification (cont.)
• The linear-form discriminant functions for multi-
  class classification
   – There are up to R(R-1)/2 decision hyperplanes for R
     pairwise separable classes
                                      Some classes may not be contiguous

                      o                           o
                    o o o
                          o                                        Δ
            x          o o                x
                                                  o      o    ΔΔ
                x                             x
        x                             x           o oo       Δ Δ
                           Δ                                       Δ
            x x                           x x      o o
    x                     Δ Δ     x                          Δ
                x                             x
                       Δ Δ    Δ
                      Δ
                                                                           23
Linear Machine and Minimum Distance
        Classification (cont.)
• Linear machine or minimum-distance classifier
  – Assume the class prototypes are known for all classes
         • Euclidean distance between input pattern x and the center of
           class i, xi :
                         x − xi = ( x − xi ) ( x − xi )
                                            t


                                                     2
         • Minimizing                x − xi              = x t x − 2 xit x + xit xi is equal to
                                                 1 t
             maximizing xit x −                    xi xi                       The same for all classes
                                                 2

  – Set the discriminant function for each class i to be:
                                           1 t
             g i ( x ) = xit x −             xi xi                                   g i ( x ) = w it y
                                           2
                                             w       = xi
              wi   x                         i

   gi (x ) =                    , where                        (          )
               wi , n +1   1 
                                                              1
                                             w i,n +1 = −       x it x i
                                                          2                                           24
Linear Machine and Minimum Distance
        Classification (cont.)



                          This approach is also called
                           correlation classification




                            An 1 as the n+1’th component
                            of the input pattern
                                    1
                   gi ( x) = xit x − xit xi   g i ( x ) = w it y
                                    2


                                                                   25
Linear Machine and Minimum Distance
        Classification (cont.)
• Example 3.2
          10                     2              -5     
  w1   =     2 , w       =     − 5 , w       =     5   
                       2                     3
          − 52              − 14 . 5            − 25   
                                                       


  g 1 ( x ) = 10 x 1 + 2 x 2 − 52
  g   (x ) =
       2          2 x 1 − 5 x 2 − 14 . 5
  g 3 (x ) =      − 5 x 1 + 5 x 2 − 25
                                                                              S 12
                                                                S 13
   S 12 : 8 x1 + 7 x 2 − 37 . 5 = 0
   S 13 : − 15 x1 + 3 x 2 + 27 = 0
   S 23 : − 7 x1 + 10 x 2 − 10 . 5 = 0


                            1
           gi ( x) = xit x − xit xi
                            2
                                                                       S 23          26
Linear Machine and Minimum Distance
        Classification (cont.)
• If R linear discriminant functions exist for a set of
  patterns such that

     g i (x ) > g   j   (x )   for x ∈ Class i,
     i = 1 , 2 ,..., R,j = 1 , 2 ,..., R , i ≠ j

   – The classes are linearly separable




                                                          27
Linear Machine and Minimum Distance
        Classification (cont.)




                                      28
Linear Machine and Minimum Distance
            Classification (cont.)
(a) 2x1-x2+2=0, decision surface is a line
(b) 2x1-x2+2=0, decision surface is a plane
(c) x1=[2,5], x2=[-1,-3]
=>The decision surface for minimum distance classifier
    (x1-x2)t x+1/2 (||x2||2-||x1||2)t=0
    3x1+ 8x2-19/2=0                     x3
(d)
                                                                   (19/16,0)

 (-1,0)           (0,2)        (-1,0)           (0,2)        (-1,0)           (0,2)
                          x1                            x1                               x1
          (0,0)                         (0,0)                         (0,0)
                                                                              (19/6,0)
   x2                           x2                            x2
                                                                                         29
Linear Machine and Minimum Distance
        Classification (cont.)
•   Examples 3.1 and 3.2 have shown that the
    coefficients (weights) of the linear
    discriminant functions can be determined if
    the a priori information about the sets of
    patterns and their class membership is
    known




                                                  30
Linear Machine and Minimum Distance
        Classification (cont.)
• The example of linearly non-separable patterns




                                                   31
Linear Machine and Minimum Distance
             Classification (cont.)

                         o1                         (-1, 1)     (1,1)
          1
x1                  TLU#1
          -1
                           1    TLU#2
     1                   o
                         2 1
x2       -1         TLU#2
     -1        -1                                             -x1-x2+1=0
                     1
-1
                         o2
                                 (1,1)
(-1, 1)                                       x1+x2+1=0

                                         o1
                                               (-1,-1)
                                                                  (1, -1)


                              (1, -1)    o1+o2-1=0                          32
Discrete Perceptron Training Algorithm
         - Geometrical Representations
   • Examine the neural network classifiers that
     derive/training their weights based on the error-
     correction scheme




                   Class 1:   wt y > 0
 g(y) = wt y
                   Class 2:   wt y < 0
Augmented
input pattern
                Vector Representations
                in the Weight Space                      33
Discrete Perceptron Training Algorithm
    - Geometrical Representations (count.)
   • Devise an analytic approach based on the
     geometrical representations
       – E.g. the decision surface for the training pattern y1

y1 in Class 1                             (      )
                                      ∇ w w t y1 = y1       Gradient
                                                            (the direction of
                                       If y1 in Class 1:     steep increase)


Weight Space
                                               w ′ = w 1 + cy1
                                                                 c controls the
                                       If y1 in Class 2:         size of adjustment
y1 in Class 2
                                              w ′ = w 1 − cy1

                                       c (>0) is the correction increment (is
                                      two times of the learning constant
Weight Space                          introduced before)
                                                                                34
Discrete Perceptron Training Algorithm
    - Geometrical Representations (count.)
                      Weight adjustments of three
                      augmented training pattern y1,
                      y2, y3 , shown in the weight
                      space
                              y1 ∈ C 1
                              y2 ∈ C1
                              y3 ∈ C    2


                      - Weights in the shaded region
                        are the solutions
                      - The three lines labeled are
                        fixed during training
Weight Space                                           35
Discrete Perceptron Training Algorithm
- Geometrical Representations (count.)
• More about the correction increment c
  – If it is not merely a constant, but related to the current
    training pattern
                                   How to select the correction increment
                                   based on the dislocates of w1 and the
                                   corrected weight vector w
                          w 1t y
                     p=
                            y        (w   1
                                              ± cy   )   t
                                                             y =0
                                          w 1t y w 1t y
                                     c = m t =        2
                                                        , because c > 0
                                           y y    y

                                                         w 1t y
                                     ⇒ cy =                      2
                                                                     y
                                                             y
                                                                            36
Discrete Perceptron Training Algorithm
    - Geometrical Representations (count.)
• For fixed correction rule with c=constant, the
  correction of weights is always the same fixed
  portion of the current training vector
   – The weight can be initialized at any value
                                  w ′ = w + ∆w
           w ′ = w ± cy or
                                         [       (    )]
                                  ∆ w = c d − sgn w t y y

• For dynamic correction rule with c dependent
  on the distance from the weight (i.e. the weight
  vector) to the decision surface in the weight
                                                    w 1t y
  space                                      ⇒ cy =      2
                                                                y
                                                            y
   – The initial weight should be different from 0
                                                                    37
Discrete Perceptron Training Algorithm
- Geometrical Representations (count.)
• Dynamic correction rule with c dependent
  on the distance from the weight

                                            w 1t y
                                   c = λ         2
                                             y
                                             w 1t y   y
                                   cy = λ
                                                 y    y




                                                          38
Discrete Perceptron Training Algorithm
- Geometrical Representations (count.)
• Example 3.3


                            1          − 0.5 y 1 ∈ C 1
                       y1 =       y2 =      
                            1           1 y2 ∈ C 2

                            3            2 y     3   ∈ C   1
                       y3 =        y4 =  
                            1           − 1  y   4   ∈ C   2



                       ∆w k =
                                   c
                                   2
                                     [         (          )]
                                     d k − sgn w kt y j y j

                         What if w kt y j = 0 ?
                         -> interpreted as a mistake
                         and followed by a correlation
                                                                   39
Continuous Perceptron
             Training Algorithm
• Replace the TLU (Threshold Logic Unit) with the
  sigmoid activation function for two reasons:
  – Gain finer control over the training procedure
  – Facilitate the differential characteristics to enable
    computation of the error gradient



                                        w = w − η ∇ E (w
                                        ˆ                               )
                                   learning constant   error gradient




                                                                            40
Continuous Perceptron
         Training Algorithm (cont.)
• The new weights is obtained by moving in the
  direction of the negative gradient along the
  multidimensional error surface




                                                 41
Continuous Perceptron
            Training Algorithm (cont.)
• Define the error as the squared difference
  between the desired output and the actual
  output        1
            E = (d − o)
                       2

               2
                1
                      [
         or E = d − f w t y
                2
                                 (     )]2
                                             =
                                                 1
                                                 2
                                                   [d − f (net )]2
           ∇ E (w ) =
                        1
                        2
                             (
                           ∇ [d − f (net )]
                                            2
                                                 )
                       ∂E                                  ∂ (net ) 
                       ∂w                                  ∂w 
                             1                                   1  
                         ∂E                                ∂ (net ) 
                    ∆ 
                          ∂w 2                                       
           ∇ E (w ) = 
                            . 
                                   = − (d − o ) f ′ (net   ) ∂ w 2  = − (d − o ) f ′ (net ) y
                                                                  .
                                                                    
                       .                                       .    
                       ∂E                                  ∂ (net ) 
                                                                    
                       ∂ w n +1 
                                                           ∂ w n +1 
                                                                                                42
Continuous Perceptron
                   Training Algorithm (cont.)
• Bipolar Continuous Activation Function
                                                                2 exp(− λ ⋅ net )
  f (net ) =
                       2
               1 + exp(− λ ⋅ net )
                                   −1      f ′(net ) = λ ⋅
                                                             [1 + exp(− λ ⋅ net )]2
                                                                                         {              } (
                                                                                    = λ ⋅ 1 − [ f (net )] = λ 1 − o 2
                                                                                                         2
                                                                                                                        )

  w = w +
  ˆ
                      1
                      2
                        η ⋅ λ (d − o ) 1 − o 2 y(               )
• Unipolar Continuous Activation Function
   f (net ) =
                        1                              λ ⋅ exp(− λ ⋅ net)
                1 + exp(− λ ⋅ net )     f ′(net) =                          = λ ⋅ f (net)[1 − f (net)] = λ ⋅ o(1 − o)
                                                     [1+ exp(− λ ⋅ net)]  2




   w = w + η ⋅ λ ⋅ (d − o )o (1 − o ) y
   ˆ
                                                                                                                  43
Continuous Perceptron
        Training Algorithm (cont.)
                                    2
• Example 3.3    f (net ) =                   −1
                              1 + exp(− net )           − 0.5
                                                   y2 =      
                                                         1
                       1
                  y1 =  
                       1




                     3                                  2
                y3 =                              y4 =  
                     1                                 − 1 




                                                              44
Continuous Perceptron
        Training Algorithm (cont.)
• Example 3.3   Total error surface   Trajectories started from four
                                         arbitrary initial weights




                                                                       45
Continuous Perceptron
         Training Algorithm (cont.)
• Treat the last fixed component of input pattern
  vector as the neuron activation threshold




                                                    46
Continuous Perceptron
         Training Algorithm (cont.)
• R-category linear classifier using R discrete
  bipolar perceptrons
  – Goal: The i-th TLU response of +1 is indicative of
    class i and all other TLU respond with -1

                                                1
                                 wi = wi +
                                 ˆ                c ⋅ (d i − o i ) y
                                                2
                                d i = 1, d j = − 1, for j = 1,2 ,..,R , j ≠ i

                                         For “local representation”




                                                                           47
Continuous Perceptron
        Training Algorithm (cont.)
• Example 3.5




                                     48
Continuous Perceptron
         Training Algorithm (cont.)
• R-category linear classifier using R continuous
  bipolar perceptrons


                            wi = wi +
                            ˆ
                                         1
                                         2
                                                             (        )
                                            η ⋅ λ (d i − o i ) 1 − o i2 y
                            for i = 1,2 ,...,R

                            d i = 1, d j = − 1, for j = 1,2 ,..,R , j ≠ i




                                                                            49
Continuous Perceptron
         Training Algorithm (cont.)
• Error function dependent on the difference
  vector d-o




                                               50
Bayes’ Classifier vs. Percepron

• Perceptron operates on the promise that the patterns to
  be classified are linear separable (otherwise the training
  algorithm will oscillate), while Bayes’ classifier assumes
  the (Gaussian) distribution of two classes certainly do
  overlap each other
• The perceptron is nonparametric while the Bayes’
  classifier is parametric (its derivation is contingent on the
  assumption of the underlying distributions)
• The perceptron is simple and adaptive, and needs small
  storage, while the Bayes’ classifier could be made
  adaptive but at the expanse of increased storage and
  more complex computations

                                                                  51
Homework

• P3.5, P3.7, P3.9, P3.22




                             52

More Related Content

What's hot

05210401 P R O B A B I L I T Y T H E O R Y A N D S T O C H A S T I C P R...
05210401  P R O B A B I L I T Y  T H E O R Y  A N D  S T O C H A S T I C  P R...05210401  P R O B A B I L I T Y  T H E O R Y  A N D  S T O C H A S T I C  P R...
05210401 P R O B A B I L I T Y T H E O R Y A N D S T O C H A S T I C P R...guestd436758
 
Structured regression for efficient object detection
Structured regression for efficient object detectionStructured regression for efficient object detection
Structured regression for efficient object detectionzukun
 
Machine learning of structured outputs
Machine learning of structured outputsMachine learning of structured outputs
Machine learning of structured outputszukun
 
Sisteme de ecuatii
Sisteme de ecuatiiSisteme de ecuatii
Sisteme de ecuatiiHerpy Derpy
 
Introduction to Numerical Methods for Differential Equations
Introduction to Numerical Methods for Differential EquationsIntroduction to Numerical Methods for Differential Equations
Introduction to Numerical Methods for Differential Equationsmatthew_henderson
 
Integrated exercise a_(book_2_B)_Ans
Integrated exercise a_(book_2_B)_AnsIntegrated exercise a_(book_2_B)_Ans
Integrated exercise a_(book_2_B)_Ansken1470
 
Introduction to Neural Netwoks
Introduction to Neural Netwoks Introduction to Neural Netwoks
Introduction to Neural Netwoks Abdallah Bashir
 
Bai giang ham so kha vi va vi phan cua ham nhieu bien
Bai giang ham so kha vi va vi phan cua ham nhieu bienBai giang ham so kha vi va vi phan cua ham nhieu bien
Bai giang ham so kha vi va vi phan cua ham nhieu bienNhan Nguyen
 
Signal Processing Course : Orthogonal Bases
Signal Processing Course : Orthogonal BasesSignal Processing Course : Orthogonal Bases
Signal Processing Course : Orthogonal BasesGabriel Peyré
 

What's hot (19)

Metric Embeddings and Expanders
Metric Embeddings and ExpandersMetric Embeddings and Expanders
Metric Embeddings and Expanders
 
05210401 P R O B A B I L I T Y T H E O R Y A N D S T O C H A S T I C P R...
05210401  P R O B A B I L I T Y  T H E O R Y  A N D  S T O C H A S T I C  P R...05210401  P R O B A B I L I T Y  T H E O R Y  A N D  S T O C H A S T I C  P R...
05210401 P R O B A B I L I T Y T H E O R Y A N D S T O C H A S T I C P R...
 
Structured regression for efficient object detection
Structured regression for efficient object detectionStructured regression for efficient object detection
Structured regression for efficient object detection
 
Computer graphics
Computer graphicsComputer graphics
Computer graphics
 
Symmetrical2
Symmetrical2Symmetrical2
Symmetrical2
 
Wordproblem
WordproblemWordproblem
Wordproblem
 
Machine learning of structured outputs
Machine learning of structured outputsMachine learning of structured outputs
Machine learning of structured outputs
 
Sisteme de ecuatii
Sisteme de ecuatiiSisteme de ecuatii
Sisteme de ecuatii
 
Cross product
Cross productCross product
Cross product
 
Introduction to Numerical Methods for Differential Equations
Introduction to Numerical Methods for Differential EquationsIntroduction to Numerical Methods for Differential Equations
Introduction to Numerical Methods for Differential Equations
 
7.3
7.37.3
7.3
 
7.3
7.37.3
7.3
 
03 finding roots
03 finding roots03 finding roots
03 finding roots
 
Chapter 04
Chapter 04Chapter 04
Chapter 04
 
Integrated exercise a_(book_2_B)_Ans
Integrated exercise a_(book_2_B)_AnsIntegrated exercise a_(book_2_B)_Ans
Integrated exercise a_(book_2_B)_Ans
 
Introduction to Neural Netwoks
Introduction to Neural Netwoks Introduction to Neural Netwoks
Introduction to Neural Netwoks
 
Bai giang ham so kha vi va vi phan cua ham nhieu bien
Bai giang ham so kha vi va vi phan cua ham nhieu bienBai giang ham so kha vi va vi phan cua ham nhieu bien
Bai giang ham so kha vi va vi phan cua ham nhieu bien
 
Signal Processing Course : Orthogonal Bases
Signal Processing Course : Orthogonal BasesSignal Processing Course : Orthogonal Bases
Signal Processing Course : Orthogonal Bases
 
06 Arithmetic 1
06 Arithmetic 106 Arithmetic 1
06 Arithmetic 1
 

Viewers also liked

Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersMohammed Bennamoun
 
Pattern Recognition - Designing a minimum distance class mean classifier
Pattern Recognition - Designing a minimum distance class mean classifierPattern Recognition - Designing a minimum distance class mean classifier
Pattern Recognition - Designing a minimum distance class mean classifierNayem Nayem
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)EdutechLearners
 
mohsin dalvi artificial neural networks presentation
mohsin dalvi   artificial neural networks presentationmohsin dalvi   artificial neural networks presentation
mohsin dalvi artificial neural networks presentationAkash Maurya
 
MPerceptron
MPerceptronMPerceptron
MPerceptronbutest
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature SelectionJames Huang
 
Space-time data workshop at IfGI
Space-time data workshop at IfGISpace-time data workshop at IfGI
Space-time data workshop at IfGITomislav Hengl
 
Pengenalan pola sederhana dg perceptron
Pengenalan pola sederhana dg perceptronPengenalan pola sederhana dg perceptron
Pengenalan pola sederhana dg perceptronArief Fatchul Huda
 
Perceptron Slides
Perceptron SlidesPerceptron Slides
Perceptron SlidesESCOM
 
Support Vector Machine
Support Vector MachineSupport Vector Machine
Support Vector MachinePutri Wikie
 
Multi Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back PropagationMulti Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back PropagationSung-ju Kim
 
Integrating R, knitr, and LaTeX via RStudio
Integrating R, knitr, and LaTeX via RStudioIntegrating R, knitr, and LaTeX via RStudio
Integrating R, knitr, and LaTeX via RStudioAaron Baggett
 
Latex Certificate
Latex CertificateLatex Certificate
Latex CertificateRakesh Jana
 
Pattern recognition for UX - 13 April 2013
Pattern recognition for UX - 13 April 2013Pattern recognition for UX - 13 April 2013
Pattern recognition for UX - 13 April 2013amelio
 

Viewers also liked (20)

Perceptron
PerceptronPerceptron
Perceptron
 
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
 
Pattern Recognition - Designing a minimum distance class mean classifier
Pattern Recognition - Designing a minimum distance class mean classifierPattern Recognition - Designing a minimum distance class mean classifier
Pattern Recognition - Designing a minimum distance class mean classifier
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
mohsin dalvi artificial neural networks presentation
mohsin dalvi   artificial neural networks presentationmohsin dalvi   artificial neural networks presentation
mohsin dalvi artificial neural networks presentation
 
MPerceptron
MPerceptronMPerceptron
MPerceptron
 
Aprendizaje Redes Neuronales
Aprendizaje Redes NeuronalesAprendizaje Redes Neuronales
Aprendizaje Redes Neuronales
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 
Space-time data workshop at IfGI
Space-time data workshop at IfGISpace-time data workshop at IfGI
Space-time data workshop at IfGI
 
Pengenalan pola sederhana dg perceptron
Pengenalan pola sederhana dg perceptronPengenalan pola sederhana dg perceptron
Pengenalan pola sederhana dg perceptron
 
Latex crash course
Latex crash courseLatex crash course
Latex crash course
 
IIT Certificate
IIT CertificateIIT Certificate
IIT Certificate
 
Perceptron Slides
Perceptron SlidesPerceptron Slides
Perceptron Slides
 
R in latex
R in latexR in latex
R in latex
 
Support Vector Machine
Support Vector MachineSupport Vector Machine
Support Vector Machine
 
Multi Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back PropagationMulti Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back Propagation
 
Integrating R, knitr, and LaTeX via RStudio
Integrating R, knitr, and LaTeX via RStudioIntegrating R, knitr, and LaTeX via RStudio
Integrating R, knitr, and LaTeX via RStudio
 
Backpropagation algo
Backpropagation  algoBackpropagation  algo
Backpropagation algo
 
Latex Certificate
Latex CertificateLatex Certificate
Latex Certificate
 
Pattern recognition for UX - 13 April 2013
Pattern recognition for UX - 13 April 2013Pattern recognition for UX - 13 April 2013
Pattern recognition for UX - 13 April 2013
 

Similar to Ann chapter-3-single layerperceptron20021031

8-5 Adding and Subtracting Rational Expressions
8-5 Adding and Subtracting Rational Expressions8-5 Adding and Subtracting Rational Expressions
8-5 Adding and Subtracting Rational Expressionsrfrettig
 
6.3_DiscriminantFunctions for machine learning supervised learning
6.3_DiscriminantFunctions for machine learning supervised learning6.3_DiscriminantFunctions for machine learning supervised learning
6.3_DiscriminantFunctions for machine learning supervised learningMrsMargaretSavithaP
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster AnalysisSSA KPI
 
Lesson 1: Functions and their representations (slides)
Lesson 1: Functions and their representations (slides)Lesson 1: Functions and their representations (slides)
Lesson 1: Functions and their representations (slides)Matthew Leingang
 
Lesson 1: Functions and their representations (slides)
Lesson 1: Functions and their representations (slides)Lesson 1: Functions and their representations (slides)
Lesson 1: Functions and their representations (slides)Mel Anthony Pepito
 
Unit 4 Review
Unit 4 ReviewUnit 4 Review
Unit 4 Reviewrfrettig
 
Piecewise functions updated_2016
Piecewise functions updated_2016Piecewise functions updated_2016
Piecewise functions updated_2016Benjamin Madrigal
 
Functions
FunctionsFunctions
FunctionsJJkedst
 
2 1 relationsfunctions
2 1 relationsfunctions2 1 relationsfunctions
2 1 relationsfunctionsFendi Ard
 
Graphing linear relations and functions
Graphing linear relations and functionsGraphing linear relations and functions
Graphing linear relations and functionsTarun Gehlot
 
Jacob's and Vlad's D.E.V. Project - 2012
Jacob's and Vlad's D.E.V. Project - 2012Jacob's and Vlad's D.E.V. Project - 2012
Jacob's and Vlad's D.E.V. Project - 2012Jacob_Evenson
 
Calculus - 1 Functions, domain and range
Calculus - 1 Functions, domain and rangeCalculus - 1 Functions, domain and range
Calculus - 1 Functions, domain and rangeIdrisJeffreyManguera
 

Similar to Ann chapter-3-single layerperceptron20021031 (20)

8-5 Adding and Subtracting Rational Expressions
8-5 Adding and Subtracting Rational Expressions8-5 Adding and Subtracting Rational Expressions
8-5 Adding and Subtracting Rational Expressions
 
6.3_DiscriminantFunctions for machine learning supervised learning
6.3_DiscriminantFunctions for machine learning supervised learning6.3_DiscriminantFunctions for machine learning supervised learning
6.3_DiscriminantFunctions for machine learning supervised learning
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Lesson 1: Functions and their representations (slides)
Lesson 1: Functions and their representations (slides)Lesson 1: Functions and their representations (slides)
Lesson 1: Functions and their representations (slides)
 
Lesson 1: Functions and their representations (slides)
Lesson 1: Functions and their representations (slides)Lesson 1: Functions and their representations (slides)
Lesson 1: Functions and their representations (slides)
 
Unit 4 Review
Unit 4 ReviewUnit 4 Review
Unit 4 Review
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Pr1
Pr1Pr1
Pr1
 
Piecewise functions updated_2016
Piecewise functions updated_2016Piecewise functions updated_2016
Piecewise functions updated_2016
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
Functions
FunctionsFunctions
Functions
 
1010n3a
1010n3a1010n3a
1010n3a
 
Gz3113501354
Gz3113501354Gz3113501354
Gz3113501354
 
Gz3113501354
Gz3113501354Gz3113501354
Gz3113501354
 
Graph of functions
Graph of functionsGraph of functions
Graph of functions
 
2 1 relationsfunctions
2 1 relationsfunctions2 1 relationsfunctions
2 1 relationsfunctions
 
Graphing linear relations and functions
Graphing linear relations and functionsGraphing linear relations and functions
Graphing linear relations and functions
 
Jacob's and Vlad's D.E.V. Project - 2012
Jacob's and Vlad's D.E.V. Project - 2012Jacob's and Vlad's D.E.V. Project - 2012
Jacob's and Vlad's D.E.V. Project - 2012
 
Exponential functions
Exponential functionsExponential functions
Exponential functions
 
Calculus - 1 Functions, domain and range
Calculus - 1 Functions, domain and rangeCalculus - 1 Functions, domain and range
Calculus - 1 Functions, domain and range
 

Ann chapter-3-single layerperceptron20021031

  • 1. Single-Layer Perceptron Classifiers Berlin Chen, 2002
  • 2. Outline • Foundations of trainable decision-making networks to be formulated – Input space to output space (classification space) • Focus on the classification of linearly separable classes of patterns – Linear discriminating functions and simple correction function – Continuous error function minimization • Explanation and justification of perceptron and delta training rules 2
  • 3. Classification Model, Features, and Decision Regions • A pattern is the quantitative description of an object, event, or phenomenon – Spatial patterns: weather maps, fingerprints … – Temporal patterns: speech signals … • Pattern classification/recognition – Assign the input data (a physical object, event, or phenomenon) to one of the pre-specified classes (categories) – Discriminate the input data within object population via the search for invariant attributes among members of the population 3
  • 4. Classification Model, Features, and Decision Regions (cont.) • The block diagram of the recognition and classification system Dimension reduction A neural network for classification and for feature extraction 4
  • 5. Classification Model, Features, and Decision Regions (cont.) • More about Feature Extraction – The compressed data from the input patterns while poses salient information – E.g. • Speech vowel sounds analyzed in 16-channel filterbanks can provide 16 spectral vectors, which can be further transformed into two dimensions – Tone height (high-low) and retraction (front-back) • Input patterns to be projected and reduced to lower dimensions 5
  • 6. Classification Model, Features, and Decision Regions (cont.) • More about Feature Extraction y x’ y’ x 6
  • 7. Classification Model, Features, and Decision Regions (cont.) • Two simple ways to generate the pattern vectors for cases of spatial and temporal objects to be classified • A pattern classifier maps input patterns (vectors) in En space into numbers (E1) which specify the membership j = i0 ( x ), j = 1, 2,..., R 7
  • 8. Classification Model, Features, and Decision Regions (cont.) • Classification described in geometric terms The decision surfaces here are curved lines i o ( x ) = j , for all x ∈ Χ j , j = 1, 2 ,..., R – Decision regions – Decision surfaces: generally, the decision surfaces for n- dimensional patterns may be (n-1)-dimensional hyper-surfaces 8
  • 9. Discriminant Functions • Determine the membership in a category by the classifier based on the comparison of R discriminant functions g1(x), g2(x),…, gR(x) – When x is within the region Xk if gk(x) has the largest value i0 ( x ) = k if g k ( x ) > g j ( x ) for k, j = 1, 2 ,..., R, k ≠ j g1 x1 g1(x) x1, x2,…., xp, ….,xP x2 g2 g2(x) P>>n gR(x) xn Assume the classifier gR Has been designed 9
  • 10. Discriminant Functions (cont.) • Example 3.1 Decision surface Equation: g ( x ) = g1 ( x ) − g 2 ( x ) = -2 x1 + x2 + 2 g ( x ) > 0 : class1 g ( x ) < 0 : class 2 The decision surface does not uniquely specify the discriminant functions The classifier that classifies patterns into two classes or categories is called “dichotomizer” “two” “cut” 10
  • 12. Discriminant Functions (cont.) (x-0,y+2, g1 -1)(2,-1,1)=0 Solution 1 2x-y-2+ g1 -1=0 g  x1  g1 ( x ) = [− 2 1]  + 3 g1=-2x+y+3 (x-0,y+2, g2 -1)(-2,1,1)=0  x2  -2x+y+2+ g2 -1=0  x1  g 2 ( x ) = [2 -1]  g2=2x-y-1 g=g1 -g2=0  x2  -4x+2y+4=0 -2x+y+2=0 y [2,-1,1] [-2,1,0] x (x-0,y+2, g1 -1)(2,-1,2)=0 Solution 2 [0,0,1] 2x-y-2+2g1 -2=0 (1,0,0) g1=-x+1/2y+2 (0,-2,1) (x-0,y+2, g2 -1)(-2,1,2)=0 [2,-1,0] -2x+y+2+ 2g2 -2=0 (0,-2,0) g2=x-1/2y An infinite number of g=g1 -g2=0 -2x+y+2=0 discriminant functions will yield correct classification 12
  • 13. Discriminant Functions (cont.) Multi-class Two-class g( x) = g1 ( x) − g2 ( x) g ( x ) > 0 : class 1 g ( x ) < 0 : class 2 subtraction Sign examination 13
  • 14. Discriminant Functions (cont.) The design of discriminator for this case is not straightforward. The discriminant functions may result as nonlinear functions of x1 and x2 14
  • 15. Bayes’ Decision Theory • A decision-making based on both the posterior knowledge obtained from specific observation data and prior knowledge of the categories – Prior class probabilities P(ωi ), ∀ class i – Class-conditioned probabilities P(x ωi ), ∀ class i P (x ω i )P (ω i ) P (x ω i )P (ω i ) k = arg max P (ω i x ) = arg max = arg max i i P (x ) i j =1 ( ) ∑ P x ω j P (ω j ) k = arg max P (ω i x ) = arg max P (x ω i )P (ω i ) i i 15
  • 16. Bayes’ Decision Theory (cont.) • Bayes’ decision rule designed to minimize the overall risk involved in making decision – The expected loss (conditional risk) when making decision δ i R (δ x ) = ∑ l (δ ω , x )P (ω x ), where l (δ ω , x ) =  0 , i = j i i j j i j j 1, i ≠ j = ∑ P (ω j x) j≠i = 1 - P (ω i x ) • The overall risk (Bayes’ risk) ∞ R = ∫ R (δ ( x ) x )p ( x )dx , δ ( x ) : the selected decision for a sample x −∞ – Minimize the overall risk (classification error) by computing the conditional risks and select the decision δ i for which the conditional risk R (δ i x ) is minimum, i.e., P (ω i x ) is maximum (minimum-error-rate decision rule) 16
  • 17. Bayes’ Decision Theory (cont.) • Two-class pattern classification g 1 ( x ) = P (ω 1 x ) ≅ P (x ω 1 )P (ω 1 ), g 2 (x ) = P (ω 2 x ) ≅ P (x ω 2 )P (ω 2 ) Bayes’ Classifier Likelihood ratio or log-likelihood ratio: ω1 ω1 P(x ω1 ) > P(ω2 ) > l (x ) = P (x ω 1 )P (ω 1 ) P (x ω 2 )P (ω 2 ) P(x ω2 ) < P(ω1 ) ω1 < ω2 > ω2 log l ( x ) = log P(x ω1 ) − log P(x ω2 ) log P(ω2 ) − log P(ω1 ) < ω2 Classification error: p (error ) = P ( x ∈ R1 , ω 2 ) + P ( x ∈ R 2 , ω 1 ) = P (x ∈ R1 ω 2 )P (ω 2 ) + P (x ∈ R 2 ω 1 )P (ω 1 ) = ∫R P (x ω 2 )P (ω 2 )dx + ∫R P (x ω 1 )P (ω 1 )dx 1 2 17
  • 18. Bayes’ Decision Theory (cont.) • When the environment is multivariate Gaussian, the Bayes’ classifier reduces to a linear classifier – The same form taken by the perceptron – But the linear nature of the perceptron is not contingent on the assumption of Gaussianity P (x ω ) =  1 1 exp  − ( x − µ ) Σ t −1 ( x − µ )  (2 π ) 1 n 2 Σ 2  2  Class ω 1 : E [ X ] = µ1 [ E ( X − µ1 )( X − µ1 ) = Σ t ] P (ω 1 ) = P (ω 2 ) = 1 2 Class ω 2 : E [ X ] = µ 2 [ E ( X − µ 2 )( X − µ 2 ) = Σ t ] Assumptions 18
  • 19. Bayes’ Decision Theory (cont.) • When the environment is Gaussian, the Bayes’ classifier reduces to a linear classifier (cont.) log l ( x ) = log P (x ω1 ) − log P (x ω 2 ) 1 =− ( x − µ1 )t Σ −1 ( x − µ1 ) + 1 ( x − µ2 )t Σ −1 ( x − µ2 ) 2 2 ( 1 t = ( µ1 − µ 2 ) Σ −1 x + µ 2 Σ −1 µ 2 − µ1 Σ −1 µ1 t 2 t ) = wx + b ω1 > ∴ log l ( x ) = wx + b 0 < ω2 19
  • 20. Bayes’ Decision Theory (cont.) • Multi-class pattern classification 20
  • 21. Linear Machine and Minimum Distance Classification • Find the linear-form discriminant function for two- class classification when the class prototypes are known • Example 3.1: Select the decision hyperplane that contains the midpoint of the line segment connecting center point of two classes 21
  • 22. Linear Machine and Minimum Distance Classification (cont.) The dichotomizer’s discriminant function g(x): x1 + x 2 ( x1 − x 2 ) t ( x − )=0 2 1 2 2 ( x1 − x 2 ) t x + ( x 2 − x1 ) = 0 2 x1 + x 2  w x Taken as  = 0 , where w n +1   1  2    w = x1 − x 2 w n +1 = 1 2 (x2 2 − x1 2 ) Augmented input pattern It is a simple minimum-distance classifier. 22
  • 23. Linear Machine and Minimum Distance Classification (cont.) • The linear-form discriminant functions for multi- class classification – There are up to R(R-1)/2 decision hyperplanes for R pairwise separable classes Some classes may not be contiguous o o o o o o Δ x o o x o o ΔΔ x x x x o oo Δ Δ Δ Δ x x x x o o x Δ Δ x Δ x x Δ Δ Δ Δ 23
  • 24. Linear Machine and Minimum Distance Classification (cont.) • Linear machine or minimum-distance classifier – Assume the class prototypes are known for all classes • Euclidean distance between input pattern x and the center of class i, xi : x − xi = ( x − xi ) ( x − xi ) t 2 • Minimizing x − xi = x t x − 2 xit x + xit xi is equal to 1 t maximizing xit x − xi xi The same for all classes 2 – Set the discriminant function for each class i to be: 1 t g i ( x ) = xit x − xi xi g i ( x ) = w it y 2 w = xi  wi   x  i gi (x ) =  , where ( ) wi , n +1   1  1 w i,n +1 = − x it x i    2 24
  • 25. Linear Machine and Minimum Distance Classification (cont.) This approach is also called correlation classification An 1 as the n+1’th component of the input pattern 1 gi ( x) = xit x − xit xi g i ( x ) = w it y 2 25
  • 26. Linear Machine and Minimum Distance Classification (cont.) • Example 3.2  10   2   -5  w1 =  2 , w =  − 5 , w =  5  2 3  − 52   − 14 . 5   − 25        g 1 ( x ) = 10 x 1 + 2 x 2 − 52 g (x ) = 2 2 x 1 − 5 x 2 − 14 . 5 g 3 (x ) = − 5 x 1 + 5 x 2 − 25 S 12 S 13 S 12 : 8 x1 + 7 x 2 − 37 . 5 = 0 S 13 : − 15 x1 + 3 x 2 + 27 = 0 S 23 : − 7 x1 + 10 x 2 − 10 . 5 = 0 1 gi ( x) = xit x − xit xi 2 S 23 26
  • 27. Linear Machine and Minimum Distance Classification (cont.) • If R linear discriminant functions exist for a set of patterns such that g i (x ) > g j (x ) for x ∈ Class i, i = 1 , 2 ,..., R,j = 1 , 2 ,..., R , i ≠ j – The classes are linearly separable 27
  • 28. Linear Machine and Minimum Distance Classification (cont.) 28
  • 29. Linear Machine and Minimum Distance Classification (cont.) (a) 2x1-x2+2=0, decision surface is a line (b) 2x1-x2+2=0, decision surface is a plane (c) x1=[2,5], x2=[-1,-3] =>The decision surface for minimum distance classifier (x1-x2)t x+1/2 (||x2||2-||x1||2)t=0 3x1+ 8x2-19/2=0 x3 (d) (19/16,0) (-1,0) (0,2) (-1,0) (0,2) (-1,0) (0,2) x1 x1 x1 (0,0) (0,0) (0,0) (19/6,0) x2 x2 x2 29
  • 30. Linear Machine and Minimum Distance Classification (cont.) • Examples 3.1 and 3.2 have shown that the coefficients (weights) of the linear discriminant functions can be determined if the a priori information about the sets of patterns and their class membership is known 30
  • 31. Linear Machine and Minimum Distance Classification (cont.) • The example of linearly non-separable patterns 31
  • 32. Linear Machine and Minimum Distance Classification (cont.) o1 (-1, 1) (1,1) 1 x1 TLU#1 -1 1 TLU#2 1 o 2 1 x2 -1 TLU#2 -1 -1 -x1-x2+1=0 1 -1 o2 (1,1) (-1, 1) x1+x2+1=0 o1 (-1,-1) (1, -1) (1, -1) o1+o2-1=0 32
  • 33. Discrete Perceptron Training Algorithm - Geometrical Representations • Examine the neural network classifiers that derive/training their weights based on the error- correction scheme Class 1: wt y > 0 g(y) = wt y Class 2: wt y < 0 Augmented input pattern Vector Representations in the Weight Space 33
  • 34. Discrete Perceptron Training Algorithm - Geometrical Representations (count.) • Devise an analytic approach based on the geometrical representations – E.g. the decision surface for the training pattern y1 y1 in Class 1 ( ) ∇ w w t y1 = y1 Gradient (the direction of If y1 in Class 1: steep increase) Weight Space w ′ = w 1 + cy1 c controls the If y1 in Class 2: size of adjustment y1 in Class 2 w ′ = w 1 − cy1 c (>0) is the correction increment (is two times of the learning constant Weight Space introduced before) 34
  • 35. Discrete Perceptron Training Algorithm - Geometrical Representations (count.) Weight adjustments of three augmented training pattern y1, y2, y3 , shown in the weight space y1 ∈ C 1 y2 ∈ C1 y3 ∈ C 2 - Weights in the shaded region are the solutions - The three lines labeled are fixed during training Weight Space 35
  • 36. Discrete Perceptron Training Algorithm - Geometrical Representations (count.) • More about the correction increment c – If it is not merely a constant, but related to the current training pattern How to select the correction increment based on the dislocates of w1 and the corrected weight vector w w 1t y p= y (w 1 ± cy ) t y =0 w 1t y w 1t y c = m t = 2 , because c > 0 y y y w 1t y ⇒ cy = 2 y y 36
  • 37. Discrete Perceptron Training Algorithm - Geometrical Representations (count.) • For fixed correction rule with c=constant, the correction of weights is always the same fixed portion of the current training vector – The weight can be initialized at any value w ′ = w + ∆w w ′ = w ± cy or [ ( )] ∆ w = c d − sgn w t y y • For dynamic correction rule with c dependent on the distance from the weight (i.e. the weight vector) to the decision surface in the weight w 1t y space ⇒ cy = 2 y y – The initial weight should be different from 0 37
  • 38. Discrete Perceptron Training Algorithm - Geometrical Representations (count.) • Dynamic correction rule with c dependent on the distance from the weight w 1t y c = λ 2 y w 1t y y cy = λ y y 38
  • 39. Discrete Perceptron Training Algorithm - Geometrical Representations (count.) • Example 3.3 1 − 0.5 y 1 ∈ C 1 y1 =   y2 =   1  1 y2 ∈ C 2 3  2 y 3 ∈ C 1 y3 =   y4 =   1  − 1  y 4 ∈ C 2 ∆w k = c 2 [ ( )] d k − sgn w kt y j y j What if w kt y j = 0 ? -> interpreted as a mistake and followed by a correlation 39
  • 40. Continuous Perceptron Training Algorithm • Replace the TLU (Threshold Logic Unit) with the sigmoid activation function for two reasons: – Gain finer control over the training procedure – Facilitate the differential characteristics to enable computation of the error gradient w = w − η ∇ E (w ˆ ) learning constant error gradient 40
  • 41. Continuous Perceptron Training Algorithm (cont.) • The new weights is obtained by moving in the direction of the negative gradient along the multidimensional error surface 41
  • 42. Continuous Perceptron Training Algorithm (cont.) • Define the error as the squared difference between the desired output and the actual output 1 E = (d − o) 2 2 1 [ or E = d − f w t y 2 ( )]2 = 1 2 [d − f (net )]2 ∇ E (w ) = 1 2 ( ∇ [d − f (net )] 2 )  ∂E   ∂ (net )   ∂w   ∂w   1   1   ∂E   ∂ (net )  ∆  ∂w 2    ∇ E (w ) =  .  = − (d − o ) f ′ (net ) ∂ w 2  = − (d − o ) f ′ (net ) y .      .   .   ∂E   ∂ (net )       ∂ w n +1     ∂ w n +1    42
  • 43. Continuous Perceptron Training Algorithm (cont.) • Bipolar Continuous Activation Function 2 exp(− λ ⋅ net ) f (net ) = 2 1 + exp(− λ ⋅ net ) −1 f ′(net ) = λ ⋅ [1 + exp(− λ ⋅ net )]2 { } ( = λ ⋅ 1 − [ f (net )] = λ 1 − o 2 2 ) w = w + ˆ 1 2 η ⋅ λ (d − o ) 1 − o 2 y( ) • Unipolar Continuous Activation Function f (net ) = 1 λ ⋅ exp(− λ ⋅ net) 1 + exp(− λ ⋅ net ) f ′(net) = = λ ⋅ f (net)[1 − f (net)] = λ ⋅ o(1 − o) [1+ exp(− λ ⋅ net)] 2 w = w + η ⋅ λ ⋅ (d − o )o (1 − o ) y ˆ 43
  • 44. Continuous Perceptron Training Algorithm (cont.) 2 • Example 3.3 f (net ) = −1 1 + exp(− net ) − 0.5 y2 =    1 1 y1 =   1 3  2 y3 =   y4 =   1  − 1  44
  • 45. Continuous Perceptron Training Algorithm (cont.) • Example 3.3 Total error surface Trajectories started from four arbitrary initial weights 45
  • 46. Continuous Perceptron Training Algorithm (cont.) • Treat the last fixed component of input pattern vector as the neuron activation threshold 46
  • 47. Continuous Perceptron Training Algorithm (cont.) • R-category linear classifier using R discrete bipolar perceptrons – Goal: The i-th TLU response of +1 is indicative of class i and all other TLU respond with -1 1 wi = wi + ˆ c ⋅ (d i − o i ) y 2 d i = 1, d j = − 1, for j = 1,2 ,..,R , j ≠ i For “local representation” 47
  • 48. Continuous Perceptron Training Algorithm (cont.) • Example 3.5 48
  • 49. Continuous Perceptron Training Algorithm (cont.) • R-category linear classifier using R continuous bipolar perceptrons wi = wi + ˆ 1 2 ( ) η ⋅ λ (d i − o i ) 1 − o i2 y for i = 1,2 ,...,R d i = 1, d j = − 1, for j = 1,2 ,..,R , j ≠ i 49
  • 50. Continuous Perceptron Training Algorithm (cont.) • Error function dependent on the difference vector d-o 50
  • 51. Bayes’ Classifier vs. Percepron • Perceptron operates on the promise that the patterns to be classified are linear separable (otherwise the training algorithm will oscillate), while Bayes’ classifier assumes the (Gaussian) distribution of two classes certainly do overlap each other • The perceptron is nonparametric while the Bayes’ classifier is parametric (its derivation is contingent on the assumption of the underlying distributions) • The perceptron is simple and adaptive, and needs small storage, while the Bayes’ classifier could be made adaptive but at the expanse of increased storage and more complex computations 51
  • 52. Homework • P3.5, P3.7, P3.9, P3.22 52