Your SlideShare is downloading. ×
The Back Propagation Learning Algorithm




  BP is extensively used and studied.
  Local minima.
  Learning can be slow.
...
Local Minima



Algorithms based on gradient descent can become stuck
in local minima.

        E
            E
          ...
Learning can be Slow



The more layers the slower learning becomes:

                                  ¡
           ¡Û   ...
Speeding up Learning



A simple method to speeding up the learning is to add a
momentum term.

        ¡Û ´Ø · ½µ   Û · «...
Encoder networks
    Momentum = 0.9 Learning Rate = 0.25



                                                       Error
 ...
Example: NetTalk




Sejnowski, T. & Rosenberg, C. (1986). Parallel networks that learn
    to pronounce English text. Com...
Example: NetTalk

                   s             26 output units




                                   80 hidden units
...
Example: Hand Written Zip Code Recognition




LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hub-
   bard, ...
Example: Hand Written Zip Code Recognition



Real input (normalised digits from the testing set)




   Knowledge of task...
Example: Hand Written Zip Code Recognition




             0   1    2    ...   9        10 output units
                 ...
Example: Hand Written Zip Code Recognition



Performance:
     error rate (%)




                                       ...
Handling temporal sequences




  “Spatialise” time (e.g. NetTalk)
  Add context units with fixed connections; some trace
 ...
Summary




  Back propagation is popular training method.
  Hidden units find useful internal representations.
  Extendabl...
Upcoming SlideShare
Loading in...5
×

The Back Propagation Learning Algorithm

5,915

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
5,915
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
283
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "The Back Propagation Learning Algorithm"

  1. 1. The Back Propagation Learning Algorithm BP is extensively used and studied. Local minima. Learning can be slow. Practical examples. Handling time. 1
  2. 2. Local Minima Algorithms based on gradient descent can become stuck in local minima. E E E wi wi wi However, generally local minima do not tend to be a problem. Speed of convergence is main problem. 2
  3. 3. Learning can be Slow The more layers the slower learning becomes:   ¡ ¡Û   Ý   Ø ßÞ ´½   Ý µ Ú Ý Æ ¡Ù   Æ Û Ú ´½   Ú µ Ü ßÞ Æ . . . Each error term Æ modifies the previous by a Ý ´½   Ý µ like term. Since Ý is a sigmoidal function (¼ Ý ½), then ¼ Ý´½   ݵ ¼ ¾ The more layers, the smaller the effective errors get, the slower the network learns. 3
  4. 4. Speeding up Learning A simple method to speeding up the learning is to add a momentum term. ¡Û ´Ø · ½µ   Û · « ¡Û ´Øµ where ¼ « ½. Each weight is given some “inertia” or “momentum” so it tends to change in the direction of its average. When weight change is same every iteration (e.g. when travelling over plateau): ¡Û ´Ø · ½µ ¡Û ´Øµ ´½   «µ¡Û ´Ø · ½µ   Û ¡Û ´Ø · ½µ   ½   « Û So, if « ¼ , effective learning rate is ½¼ . Higher-order techniques (e.g. conjugate gradient) faster. 4
  5. 5. Encoder networks Momentum = 0.9 Learning Rate = 0.25 Error 10.0 0.0 0 402 Input Set[3] Output Set[0] Pat 1 Pat 1 Pat 2 Pat 2 Pat 3 Pat 3 Pat 4 Pat 4 Pat 5 Pat 5 Pat 6 Pat 6 Pat 7 Pat 7 Pat 8 Pat 8 8 inputs: local encoding, 1 of 8 active. Task: reproduce input at output layer (“bottleneck”) After 400 epochs, activation of hidden units: Pattern Hidden units Pattern Hidden units 1 1 1 1 5 1 0 0 2 0 0 0 6 0 0 1 3 1 1 0 7 0 1 0 4 1 0 1 8 0 1 1 Also called “self-supervised” networks. Related to PCA (a statistical method). Application: compression. Local vs distributed representations. 5
  6. 6. Example: NetTalk Sejnowski, T. & Rosenberg, C. (1986). Parallel networks that learn to pronounce English text. Complex Systems 1, 145–168. task: to convert continuous text into speech. input: a window of letters from English text drawn from a 1000 word dictionary. 7-letter context to disambiguate “brave”, “gave” vs “have” output: phonetic representation of speech (which can be fed into a synthesiser). s Hidden Units T h i s i s t h e i n p u t 6
  7. 7. Example: NetTalk s 26 output units 80 hidden units Hidden Units in a single layer 7 29 input units ¯ Input: letter encoded using 1 of 29 units (26 + 3 for punctuation) ¯ Output: distributed representation across 21 features including vowel height, position in mouth; 5 fea- tures for stress. Performance: 90% correct on training set. 80–87% correct on test set. Two small hidden layers better than one big layer. Babbling during learning? Hidden representations: vowel v consonants? 7
  8. 8. Example: Hand Written Zip Code Recognition LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hub- bard, L. & Jackel, L. (1989). Backpropagation applied to hand- written zip code recognition. Neural Computation 1, 541–551. task: Network is to learn to recognise handwritten digits taken from U.S. Mail. input: Digitised hand written numbers. output: One of 10 units is to be most active – the unit that represents the correctly recognised numeral. 8
  9. 9. Example: Hand Written Zip Code Recognition Real input (normalised digits from the testing set) Knowledge of task constrains architecture. “Feature detectors” useful. Implemented by weight-sharing. Reduces free parameters, speeds up learning. 9
  10. 10. Example: Hand Written Zip Code Recognition 0 1 2 ... 9 10 output units fully connected (310 weights) H3 ... 30 hidden units fully connected (5790 weights) 12 16 hidden units H2.1 ... H2.12 8 5 5 kernels (38592 links) from 12 H1 sets (2592 weights) 12 64 hidden units H1.1 ... H1.12 12 5 5 (19968 links) kernels (1068 weights) 16 16 digitised grayscale images Before weight sharing 64660 links After weight sharing 9760 weights 10
  11. 11. Example: Hand Written Zip Code Recognition Performance: error rate (%) test set training set training passes Hidden units developed spatial filters (centre-surround). Better than earlier study which used specialised hand- crafted features (Denker et al, 1989). 11
  12. 12. Handling temporal sequences “Spatialise” time (e.g. NetTalk) Add context units with fixed connections; some trace over time. Standard b.p. can be used in these cases. (fig 7.5 of HKP) For fully recurrent networks, b.p. extended to Real- Time Recurrent Learning (Williams & Zipser, 1989). 12
  13. 13. Summary Back propagation is popular training method. Hidden units find useful internal representations. Extendable to temporal sequences. Problems: can be slow, no convergence theorem. Need to try different architectures (#layers) , learning rates. Biological plausibility? 1. Who provides the targets? 2. Can signals (errors) backpropagate from one cell to another? 13

×