TEL-AVIV UNIVERSITY              The Iby and Aladar Fleischman                  Faculty of EngineeringTRADE-OFF BETWEEN RE...
TEL-AVIV UNIVERSITYThis work was carried out under the supervision of          Doctor Nathan Intrator                     ...
This work is dedicated to my family
AcknowledgmentI would like to thank my husband, daughter and parents for their tolerance and moralsupport during the compl...
AbstractAutonomous and efficient action of robots requires a robust robot vision system that cancope with variable light and...
Contents1 Introduction                                                                               1  1.1   General moti...
3.1.4   Classification via reconstruction . . . . . . . . . . . . . . . . . . . . 35        3.1.5   Other applications of r...
5.4.1   Different architecture constraints and regularization ensembles . . . 86   5.5   Saliency detection . . . . . . . ....
List of Figures 2.1   Supervised feed-forward network . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2   Hybrid netw...
6.6   Motion blur and deblur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.7   Blind deconvolution . . . ...
List of Tables 4.1   Unsupervised constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.1   Classificati...
Chapter 1Introduction1.1     General motivation1.1.1    Robotic visionNowadays, robots that can move and operate autonomou...
Chapter 1: Introduction                                                                     2modeled. However, in many cas...
Chapter 1: Introduction                                                                  3representation of the images. As...
Chapter 1: Introduction                                                                    4potential visual tasks (Intrat...
Chapter 1: Introduction                                                                  51.2     Overview of the thesisTh...
Chapter 1: Introduction                                                                  6has some useful properties, such...
Chapter 1: Introduction                                                                7under different types of the blur o...
Chapter 2Statistical formulation of theproblemImages as input to Neural Networks are a very high dimensional data with the...
Chapter 2: Statistical formulation of the problem                                                   9   Below, we present ...
Chapter 2: Statistical formulation of the problem                                         10training set D, reducing the b...
Chapter 2: Statistical formulation of the problem                                                                         ...
Chapter 2: Statistical formulation of the problem                                               122.3      Variance contro...
Chapter 2: Statistical formulation of the problem                                          13are presented in (Bishop, 199...
Chapter 2: Statistical formulation of the problem                                            14the latter constraint for a...
Chapter 2: Statistical formulation of the problem                                          15sigmoidal, for example) and t...
Chapter 2: MDL and Bayesian principles                                                   16successfully learn many related...
Chapter 2: MDL and Bayesian principles                                                 172.5      Minimum Description Leng...
Chapter 2: MDL and Bayesian principles                                                    18  3. The reconstruction error,...
Chapter 2: MDL and Bayesian principles                                                   19  3. When the main task is reco...
Chapter 2: MDL and Bayesian principles                                                     20minimizes the description len...
Chapter 2: MDL and Bayesian principles                                                              21Using the Bayesian f...
Chapter 2: MDL and Bayesian principles                                                 22the previous section. We have see...
Chapter 2: MDL and Bayesian principles                                                      23                            ...
Chapter 2: MDL and Bayesian principles                                                  24where r is the number of trainin...
Chapter 2: MDL and Bayesian principles                                                 25control complexity of the network...
Chapter 2: MDL and Bayesian principles                                                    26independence of the hidden and...
Chapter 2: MDL and Bayesian principles                                                    27once for all samples, while in...
Chapter 2: Regularization problem                                                       282.8      Appendix to Chapter 2: ...
Chapter 2: Regularization problem                                                        29overcome this drawback are cros...
Chapter 3Imposing bias via reconstructionconstraints3.1        IntroductionReconstruction is one of the important tasks of...
Chapter 3: Reconstruction constraints                                                   31by projecting the data to this n...
Chapter 3: Reconstruction constraints                                                     32                             A...
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition
Upcoming SlideShare
Loading in …5
×

Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition

1,189 views

Published on

Autonomous and ecient action of robots requires a robust robot vision system that can
cope with variable light and view conditions. These include partial occlusion, blur, and
mainly a large scale di erence of object size due to variable distance to the objects. This
change in scale leads to reduced resolution for objects seen from a distance. One of the
most important tasks for the robot's visual system is object recognition. This task is also
a ected by orientation and background changes. These real-world conditions require a
development of speci c object recognition methods.
This work is devoted to robotic object recognition. We develop recognition methods
based on training that includes incorporation of prior knowledge about the problem.
The prior knowledge is incorporated via learning constraints during training (parameter
estimation). A signi cant part of the work is devoted to the study of reconstruction
constraints. In general, there is a tradeo between the prior-knowledge constraints and
the constraints emerging from the classi cation or regression task at hand. In order to
avoid the additional estimation of the optimal tradeo between these two constraints, we
consider this tradeo as a hyper parameter (under Bayesian framework) and integrate
over a certain (discrete) distribution. We also study various constraints resulting from
information theory considerations.
Experimental results on two face data-sets are presented. Signi cant improvement in
face recognition is achieved for various image degradations such as, various forms of image
blur, partial occlusion, and noise. Additional improvement in recognition performance is
achieved when preprocessing the degraded images via state of the art image restoration
techniques.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,189
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition

  1. 1. TEL-AVIV UNIVERSITY The Iby and Aladar Fleischman Faculty of EngineeringTRADE-OFF BETWEEN RECOGNITION AND RECONSTRUCTION: APPLICATION OF NEURAL NETWORKS TO ROBOTIC VISION Thesis submitted for the degree ”Doctor of Philosophy” by INNA STAINVAS Submitted to the Senate of Tel-Aviv University 1999
  2. 2. TEL-AVIV UNIVERSITYThis work was carried out under the supervision of Doctor Nathan Intrator and Doctor Amiram Moshaiov
  3. 3. This work is dedicated to my family
  4. 4. AcknowledgmentI would like to thank my husband, daughter and parents for their tolerance and moralsupport during the completion of this thesis. I am greatly indebted to my first advisor Dr. Amiram Moshaiov, who gave me achance to start as a Ph.D. Student at the Engineering Faculty of Tel-Aviv University,when I was only two months in Israel. I am very grateful to him for proposing to work inNeural Networks and Computer Vision and for allowing me freedom in my research. I have been pleasantly surprised by the flexibility of the educational system of the Tel-Aviv University in allowing me to listen and participate in courses at different faculties,such as the Engineering Faculty, Computer Science and Foreign Languages. While taking courses in Neural Networks, I met Dr. Nathan Intrator, who became mymain supervisor and collaborator for more than five years. He opened me to a new worldof Neural Networks and I have learned much from him, not only on the technical aspectsbut also on scientific research methodologies. Without him, this thesis would have neverappear. I am grateful to him for his tolerance, endless support and guidance. It is impossible to thank all the people who helped me, but I would like to mention thesystem administrator of the Engineering faculty, Udi Mottelo, the Department secretaryAriella Regev, the secretary of the Emigration Support department Ahuva, my friends,and the people of the Neural Computation Group of Computer Science faculty, YairShimshoni, Nurit Vatnick and Natalie Japkowich. This work was supported by grants from the Rich Foundation, the Don and SaraMarejn Scholarship Fund and by a grant from the Ministry of Science to Dr. NathanIntrator. Inna StainvasMarch 8, 1999
  5. 5. AbstractAutonomous and efficient action of robots requires a robust robot vision system that cancope with variable light and view conditions. These include partial occlusion, blur, andmainly a large scale difference of object size due to variable distance to the objects. Thischange in scale leads to reduced resolution for objects seen from a distance. One of themost important tasks for the robot’s visual system is object recognition. This task is alsoaffected by orientation and background changes. These real-world conditions require adevelopment of specific object recognition methods. This work is devoted to robotic object recognition. We develop recognition methodsbased on training that includes incorporation of prior knowledge about the problem.The prior knowledge is incorporated via learning constraints during training (parameterestimation). A significant part of the work is devoted to the study of reconstructionconstraints. In general, there is a tradeoff between the prior-knowledge constraints andthe constraints emerging from the classification or regression task at hand. In order toavoid the additional estimation of the optimal tradeoff between these two constraints, weconsider this tradeoff as a hyper parameter (under Bayesian framework) and integrateover a certain (discrete) distribution. We also study various constraints resulting frominformation theory considerations. Experimental results on two face data-sets are presented. Significant improvement inface recognition is achieved for various image degradations such as, various forms of imageblur, partial occlusion, and noise. Additional improvement in recognition performance isachieved when preprocessing the degraded images via state of the art image restorationtechniques.
  6. 6. Contents1 Introduction 1 1.1 General motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Robotic vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Internal data representation . . . . . . . . . . . . . . . . . . . . . . 2 1.1.3 Data compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.4 Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Statistical formulation of the problem 8 2.1 Bias-Variance error decomposition for a single predictor . . . . . . . . . . . 9 2.2 Variance control without imposing a learning bias . . . . . . . . . . . . . . 10 2.3 Variance control by imposing a learning bias . . . . . . . . . . . . . . . . . 12 2.3.1 Smoothness constraints . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Invariance bias constraints . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3 Specific bias constraints . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Reconstruction bias constraints . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Minimum Description Length (MDL) Principle . . . . . . . . . . . . . . . . 17 2.5.1 Minimum description length . . . . . . . . . . . . . . . . . . . . . . 19 2.6 Bayesian framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7 MDL in the feed-forward NN . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7.1 MDL and EPP bias constraints . . . . . . . . . . . . . . . . . . . . 24 2.8 Appendix to Chapter 2: Regularization problem . . . . . . . . . . . . . . . 283 Imposing bias via reconstruction constraints 30 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . 30 3.1.2 Autoencoder network and MDL . . . . . . . . . . . . . . . . . . . . 31 3.1.3 Reconstruction and generative models . . . . . . . . . . . . . . . . 34 i
  7. 7. 3.1.4 Classification via reconstruction . . . . . . . . . . . . . . . . . . . . 35 3.1.5 Other applications of reconstruction . . . . . . . . . . . . . . . . . . 38 3.2 Imposing reconstruction constraints . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 Reconstruction as a bias imposing mechanism . . . . . . . . . . . . 38 3.2.2 Hybrid classification/reconstruction network . . . . . . . . . . . . . 40 3.2.3 Hybrid network and MDL . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.4 Hybrid network as a generative probabilistic model . . . . . . . . . 43 3.2.5 Hybrid Neural Network architecture . . . . . . . . . . . . . . . . . . 44 3.2.6 Network learning rule . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.7 Hybrid learning rule. . . . . . . . . . . . . . . . . . . . . . . . . . . 484 Imposing bias via unsupervised learning constraints 50 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Information principles for sensory processing . . . . . . . . . . . . . . . . . 51 4.3 Mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1 Entropy maximization (ME) . . . . . . . . . . . . . . . . . . . . . . 53 4.3.2 Minimization of the output mutual information (MMI) . . . . . . . 55 4.3.3 Relation to Exploratory Projection Pursuit. . . . . . . . . . . . . . 57 4.3.4 BCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.5 Sum of entropies of the hidden units . . . . . . . . . . . . . . . . . 59 4.3.6 Nonlinear PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.7 Reconstruction issue . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Imposing unsupervised constraints . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 Imposing unsupervised and reconstruction constraints . . . . . . . . . . . . 625 Real world recognition 69 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1.1 Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2.1 Different architecture constraints . . . . . . . . . . . . . . . . . . . 75 5.2.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.3 Neural Network Ensembles . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.4 Face data-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2.5 Face normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2.6 Learning parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3 Type of image degradations . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 ii
  8. 8. 5.4.1 Different architecture constraints and regularization ensembles . . . 86 5.5 Saliency detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5.1 Saliency map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.7 Appendix to Chapter 5: Hidden representation exploration . . . . . . . . . 956 Blurred image recognition 100 6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1.1 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 Image degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2.1 Main filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.2 Other types of degradation . . . . . . . . . . . . . . . . . . . . . . . 106 6.3 Image restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3.1 MSE minimization and regularization . . . . . . . . . . . . . . . . . 107 6.3.2 Image restoration in the frequency domain . . . . . . . . . . . . . . 109 6.3.3 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4.1 Image filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.4.2 Classification of noisy data . . . . . . . . . . . . . . . . . . . . . . . 114 6.4.3 Gaussian blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.4.4 Motion blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.4.5 Blind deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.4.6 All training schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217 Summary and future work 124 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.2 Directions for future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 iii
  9. 9. List of Figures 2.1 Supervised feed-forward network . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Hybrid network with EPP constraints . . . . . . . . . . . . . . . . . . . . . 25 3.1 Autoencoder network architecture . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Eigenspaces extracted by PCA . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Combined recognition/reconstruction network . . . . . . . . . . . . . . . . 40 3.4 Hybrid network with reconstruction and EPP constraints . . . . . . . . . . 41 3.5 Detailed architecture of the recognition/reconstruction network . . . . . . 45 4.1 Feed-forward network for independent component extraction . . . . . . . . 53 4.2 Pdf’s graphs for a family of the exponential density functions . . . . . . . . 65 4.3 Exploratory projection pursuit network . . . . . . . . . . . . . . . . . . . . 66 5.1 Misclassification rate time evolution . . . . . . . . . . . . . . . . . . . . . . 77 5.2 MSE (mean-squared) recognition error time evolution . . . . . . . . . . . . 78 5.3 Classification based regularization . . . . . . . . . . . . . . . . . . . . . . . 79 5.4 “Caricature” faces in three resolutions . . . . . . . . . . . . . . . . . . . . 81 5.5 Image degradation and reconstruction (TAU data-set) . . . . . . . . . . . . 84 5.6 Summary of different networks and different image degradations . . . . . . 90 5.7 Saliency map construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.8 Hidden unit activities vs. classes - for an unconstrained network . . . . . . 96 5.9 Hidden unit activities vs. classes - for a reconstruction network . . . . . . . 97 5.10 Pdf ’s of the hidden unit activities . . . . . . . . . . . . . . . . . . . . . . 98 5.11 Hidden weight representation . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.1 Experimental design schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Training scheme C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.3 Degraded images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.4 Noisy Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.5 Gaussian blur and restoration . . . . . . . . . . . . . . . . . . . . . . . . . 116 iv
  10. 10. 6.6 Motion blur and deblur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.7 Blind deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.8 Recognition of blurred images via schemes A–C . . . . . . . . . . . . . . . 1206.9 Reconstruction of Gaussian blurred images . . . . . . . . . . . . . . . . . . 123 v
  11. 11. List of Tables 4.1 Unsupervised constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.1 Classification results for Pentland data-set . . . . . . . . . . . . . . . . . . 85 5.2 Different ensemble types (Pentland data-set) . . . . . . . . . . . . . . . . . 87 5.3 Different ensemble types (TAU data-set) . . . . . . . . . . . . . . . . . . . 88 5.4 Recognition using saliency map (Pentland data-set) . . . . . . . . . . . . . 92 5.5 Recognition using saliency map (TAU data-set) . . . . . . . . . . . . . . . 93 6.1 Classification results for filtered data . . . . . . . . . . . . . . . . . . . . . 112 6.2 Noise and restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3 Gaussian blur and restoration . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.4 Motion blur and restoration . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.5 Blind deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.6 Blurred image recognition via joined ensembles . . . . . . . . . . . . . . . . 121 7.1 Classification error for reconstructed images . . . . . . . . . . . . . . . . . 127 vi
  12. 12. Chapter 1Introduction1.1 General motivation1.1.1 Robotic visionNowadays, robots that can move and operate autonomously in a real-world are in highdemand. One of the main perception tasks that has to be addressed in this context is arecognition task. The recognition task in a real-world environment is challenging as it hasto address data variability, such as orientation, changing background, partial occlusionand blur, etc. For illustration let us consider a vision-guided robot helicopter which has to navigateautonomously using only on-board sensors and computing power (Chopper, 1997). Oneof the basic difficulties in recognition of images taken by helicopter cameras during an op-eration is the significant difference between these images and the images, which the robotis acquainted with in ideal flight conditions. Usually, the images taken during operationcontain a large amount of degradation caused by diverse factors, such as illuminationchanging, bad weather conditions, relative motion between the cameras and the objectof interest in the scene, shadows and, low resolution capacity of the cameras, etc. Someof these factors cause images to look blurred and foggy, others lead to noise and partialocclusion. All these factors are crucial for recognition performance and require specialcare. Among the possible approaches to improve recognition performance of degraded im-ages is an endeavor to recover images using state of the art restoration techniques aspreprocessing before a recognition stage. This preprocessing requires estimation of thedegradation process, e.g. the type and parameters of the blur operation. Another ap-proach is to directly address the variability in the recognition system. It is well knownthat for a restoration process to be successful a degradation process has to be accurately 1
  13. 13. Chapter 1: Introduction 2modeled. However, in many cases, an exact modeling is impractical, and the restored im-ages remain partially degraded and contain artifacts. Furthermore, restoration methodsare often computationally expensive and require a-priori knowledge or human interaction.It follows that efforts have to be concentrated on development of recognition methods thatare more robust to image degradations.1.1.2 Internal data representationAn important aspect of robust recognition methods is construction of an internal datarepresentation (feature extraction), that captures the significant structure of the data.According to D. Marr (1982) finding an internal representation is an inherent componentof the vision process.Feature based representation Many recognition methods include grouping or per-ceptual organization as a first stage of the visual processing. In this stage, objects arerepresented as models, containing the essential features and logic tight rules needed forrecognition. Some methods extract “anchor points” (Ullman, 1989; Brunelli and Poggio,1992), others consider edge segments as interesting feature elements (Bhanu and Ming,1987; Liu and Srinath, 1984). A relatively new approach is a deformable template match-ing (Grenander, 1978; Brunelli and Poggio, 1993; Jain et al., 1996) and using generalizedsplines for object classification (Lai, 1994). These methods attempt to extract salientfeatures locally in the low level stage of the visual processing, according to subjectiveunderstanding of an investigator. Therefore, finding an internal representation based onextraction of object features and relation between them may be limited.Learning internal representations via Neural Networks A radical alternativeapproach is to use all the available intensity information for finding internal representation.Principal Component Analysis (PCA) (Fukunaga, 1990) is a non neural network exampleof this approach, where internal representation space is spanned by the largest eigenvectorsof the data covariance matrix. These eigenvectors are macro-features extracted implicitlyfrom the images. When fed with intensity images, Neural Networks similar to PCA extractinternal representation in the space of hidden unit activities. Processing an image as a whole is a high dimensional recognition task that leads tothe curse of dimensionality (Bellman, 1961) which means that there is not enough datato robustly train a classifier in a high dimensional space. As an example, a network witha single hidden unit and input images of 60 × 60 pixels has 3600 weight parameters thathave to be estimated. Thus, the main issue is finding an intrinsic low dimensional
  14. 14. Chapter 1: Introduction 3representation of the images. As was pointed out by Geman et al. (1992), a way to avoidthe curse of dimensionality in Neural Networks is to prewire the important generalizationsby purposefully introducing learning bias. The work presented in this thesis is specifically devoted to this issue. We developimage recognition techniques using hybrid feed-forward Neural Networks, obtained byintroducing a learning bias. In particular, we investigate the influence of the novel re-construction learning constraints on the recognition performance of feed-forward NeuralNetworks. In addition, we propose to use other learning constraints based on informationtheory, and subsequently compare their efficiency with reconstruction learning constraints.We demonstrate that hybrid Neural Networks are robust to real-world degradation in theinput visual data and show that their performance can be further enhanced when stateof the art (deblur) techniques are also incorporated.1.1.3 Data compressionOften, a compression goal is defined as finding a compact data representation leading togood data reconstruction. Principal Component Analysis (PCA), Discrete Fourier Trans-form (DFT) and its generalization, Wavelet Transform and advanced best basis repre-sentations (Coifman and Wickerhauser, 1992), are examples of compression techniques.Compression may be also realized via an autoencoder network (Cottrell et al., 1987). Theautoencoder is a multi layer perceptron (MLP) type of the network with the output layercoinciding with the input layer and a hidden layer of a small size. Recently a novel type of an autoencoder network has been proposed by Zemel (1993).The hidden layer is allowed to have a large number of hidden units but it has different con-straints on the developed hidden representation. The network is simultaneously trainedto accurately reconstruct the input and to find a succinct representation in the hiddenlayer, assuming sparse or population code formation in the autoencoder hidden layer. When the main task is recognition, the compressed data representation has been usedinstead of the original (high-dimensional) data (Kirby and Sirovich, 1990; Turk and Pent-land, 1991; Murase and Nayar, 1993; Bartlett et al., 1998). Recognition from this rep-resentation is faster and may have better generalization performance. However, it isclear, that such compression is task-independent and may be inappropriate for a specificrecognition task (Huber, 1985; Turk and Pentland, 1993). We seek a compact data description that is task-dependent, and is good for recognition.Thus, the quality of the compression scheme is judged by its generalization property.Often, a separate low-dimensional representation is created for every specific task at hand.Another strategy could be to discover a hidden representation that is suitable for several
  15. 15. Chapter 1: Introduction 4potential visual tasks (Intrator and Edelman, 1996). We show that a good task-dependentcompression is obtained when the data representation is constructed not only to minimizethe mean-squared recognition error, but also to maintain data fidelity and/or to extractgood statistical properties. These good properties may be the independence of hiddenneurons, maximum information transfer in the hidden layer or a multi-modal distributionof the hidden unit activities. Therefore, in this case compression is task-dependent and isassisted by the a-priori knowledge. In summary, we investigate lossy compression techniques based on the two visual tasks- image recognition and reconstruction. Our goal is to find a hidden representation thatoptimizes the recognition using hints of the reconstruction task.1.1.4 Face recognitionThe performance of the proposed recognition schemes is examined on two facial datasets. Face recognition has gained much attention in recent years due to the variety ofcommercial applications, such as video conferencing, security, human communication androbotics. Face recognition has recently attracted special attention of different humanrobotic groups, that intensively work on the creation of personal adaptive robots to assistthe frail and elderly blind people, and creation of working mobile robots for deliveryassistance (Hirukawa, 1997; Connolly, 1997). This recognition task is a very difficult one (Chellapa et al., 1995), since it is a high di-mensional classification problem leading to “curse of dimensionality”. This is complicatedby the large variability of the facial data sets due to: • viewpoint dependence • nonrigidity of the faces • variable lighting conditions • motionThe task of face recognition is a particular case of the learning when the variability ofthe data describing the same class is comparable with the similarity between differentclasses. Other important possible recognition tasks from the same category may be therecognition of different kinds of tanks, ships, planes and cars, etc.
  16. 16. Chapter 1: Introduction 51.2 Overview of the thesisThe thesis focuses on developing Neural Network techniques that improve the recognitionperformance. A key aspect of this work is finding data representations that lead to bettergeneralization. We show that networks which are trained to recognize and reconstructimages simultaneously extract features that improve recognition. Improved performanceis also achieved when networks are trained to find other statistical structures in the data.The thesis is organized as follows:Chapter 2: Formulates the recognition task in the framework of the “bias-variance”dilemma. We show that for a good generalization ability the variance portion of thegeneralization error has to be properly controlled. We discuss different methods to controlthe variance portion of the generalization error and present two main approaches: reducingthe variance via ensemble averaging and introducing a learning bias. We review differenttypes of learning bias constraints, and finally, propose reconstruction constraints as anovel type of bias constraints in the context of feed-forward networks. Starting from Section 2.5, we discuss the relation between the “bias-variance” dilemmain statistics, MDL principle and Bayesian framework. We show that the introduction ofa learning bias corresponds to a model-cost in the description length, which has to beminimized along with an error-cost under the MDL principle. At the same time, underthe Bayesian framework, the model-cost corresponds to prior knowledge about the weightsand hidden representation distributions.Chapter 3: Introduces a hybrid feed-forward network architecture, which uses the re-construction constraints as a bias imposing mechanism for the recognition task. This net-work, which can be interpreted under MDL and Bayesian frameworks, modifies the lowdimensional representation by minimizing concurrently the mean squared error (MSE)of reconstruction and classification outputs. In other words, it attempts to improve thequality of the hidden layer representation by imposing a feature selection useful for bothtasks, classification and reconstruction. A significance of each of the tasks is controlledby a trade-off parameter λ, which is interpreted as a hyper-parameter in the Bayesianframework. Finally, this chapter presents technical details about the network architectureand its learning rule.Chapter 4: Discusses various information theory principles as constraints for the clas-sification task. We introduce a hybrid neural network with a hidden representation which
  17. 17. Chapter 1: Introduction 6has some useful properties, such as the independence between hidden layer neurons ormaximum information transfer in the hidden layer, etc.Chapter 5: Discusses the face recognition task. We review different Neural Networksmethods used for face recognition and apply the hybrid networks introduced in Chap-ters 3–4. This chapter contains technical details related to face normalization and learn-ing procedures. It is shown that the best regularized network is impractical for degradedimage recognition, and integration over different regularization parameters and differentinitial weights is preferable. This integration is roughly approximated by averaging overnetwork ensembles. We consider three ensemble types: Unconstrained ensemble that cor-responds to integration over initial weights and fixed trade-off parameter λ = 0, i.e. thehidden representation is based on the recognition task alone; Reconstruction ensemblethat corresponds to integration over different values of the trade-off parameter λ for fixedinitial weights. Joined ensemble that corresponds to integration over both the trade-offparameter λ and initial weights and is obtained by merging unconstrained and reconstruc-tion ensembles. Classification results on the degraded images, such as noisy, partially occluded andblurred images are presented. We show that the joined ensemble is superior to the recon-struction ensemble, which in turn is superior to the unconstrained ensemble. Finally weconclude that reconstruction constraints improve generalization, especially under imagedegradations. In addition we show that via saliency maps (Baluja, 1996) reconstructioncan deemphasize degraded regions of the input, thus leading to classification improvementunder “Salt and Pepper” noise.Chapter 6: Addresses recognition of blurred and noisy images. In practice, imagesappear blurred due to motion, weather conditions and camera defocusing. Several meth-ods that address recognition of blurred images are proposed: (i) Expansion the trainingset with Gaussian blurred images; (ii) Constraining reconstruction of blurred images tothe original images during training; (iii) Usage of state of the art restoration methods aspreprocessing to degraded images. Three types of joined ensembles were considered and compared: Ensemble of networkstrained on the original training data only, and ensembles trained on the training setexpanded with Gaussian blurred images and with reconstruction constraints of two types,where the first is a simple duplication of the input in the output and second as describedabove in (ii). It was shown that training with blurred images leads to a robust classification result
  18. 18. Chapter 1: Introduction 7under different types of the blur operations and is more important than the restorationmethods.Chapter 7: Summarizes our research and gives some perspective to its future develop-ment, such as: • Testing the hybrid architecture performance on the non face data sets of similar object images, such as military, medical and astronomical • Ensemble interpretation • Using the recurrent network architecture • Weighted network ensemble averaging based on the different error types between input and output reconstruction layers • Using invariance constraints (tangent prop like, see Chapter 2) regularization terms for different types of blur operations for both recognition and reconstruction tasks • Generalization of the proposed hybrid network on the other types of the generative (reconstruction) models constrained by the classification task
  19. 19. Chapter 2Statistical formulation of theproblemImages as input to Neural Networks are a very high dimensional data with the size equalto the number of pixels in the image. In this case, the number of the network weightparameters is considerably larger than the size of the training set. This leads to thecurse of dimensionality (Bellman, 1961), which means that there is not enough datato robustly train a classifier in a high dimensional space. Until recently, estimation insuch cases sounded unrealistic, but it is now accepted that such estimation is possibleif the actual dimensionality of the input data is much smaller. In other words, a true,intrinsic dimensionality reduction is possible. A simple dimensionality reduction solelyvia a bottleneck network architecture does not cope with the problem, since a networkcontinues to be an over-parameterized model (i.e. the number of free weight parametersremains large). It is well known that an estimation error is composed of two portions, bias and variance(Geman et al., 1992). The over-parameterized models usually have a small bias (unlessthey are incorrect), but have high variance, since the available data is always small com-pared to the number of the free parameters and this leads to a high sensitivity to noisein the training data. To robustify the estimator, the variance portion of the error hasto be controlled. One of the ways to control variance is via averaging single estimatorstrained on the same task. The other method controls variance by introducing a learningbias as constraints on the network architecture. Different types of smoothing constraintsare widely spread (Wahba, 1990; Murray and Edwards, 1993; Raviv and Intrator, 1996;Munro, 1997). However, as has been pointed out by Geman et al. (Geman et al., 1992)to solve the bias/variance dilemma innovative bias constraints have to be used. Introduc-tion of these constraints into the network model leads naturally to a true dimensionalityreduction (Intrator, 1999). 8
  20. 20. Chapter 2: Statistical formulation of the problem 9 Below, we present the bias-variance dilemma and review methods to control the vari-ance and bias portions of the prediction error. Then we propose to use image reconstruc-tion as an innovative bias constraint for image classification. We proceed with discussionon the relation between the “bias-variance” dilemma in statistics, MDL principle andBayesian networks.2.1 Bias-Variance error decomposition for a single predictorThe basic objective of the estimation problem is to find a function fD (x) = f (x; D) given afinite training set D, composed of n input/output pairs, D = {(xµ , yµ )}n µ=1 x ∈ Rd , y ∈R1 , drawn independently according to an unknown distribution P (x, y), which “best”approximates the “target” function y (Geman et al., 1992). Evaluation of the performance of the estimator is usually done via a mean squarederror by taking the expectation with respect to a marginal probability P (y|x):E(x; D) ≡ E[(y − fD (x))2 |x, D] = E[(y − E[y|x])2 |x, D] + E[(fD (x) − E[y|x])2 |x, D] + V ar(y|x) 2E[(y − E[y|x])(fD (x) − E[y|x])|x, D] (2.1.1) =0It can be seen that the third term in the sum is equal to zero, since (fD (x) − E[y|x])does not depend on the distribution P (y|x) and plays the role of a factor, while E[(y −E[y|x])|x, D] is equal to zero. The first term does not depend on the predictor f andmeasures the variability of y given x (in the model with additive independent noise y =f (x) + η(x) this term measures a noise variance in x). The contribution of the secondterm can be reduced by optimizing f . This term measures the squared distance betweenthe estimator fD (x) and the mean of y given x (E[y|x]). A good estimator has to generalize well to new sets drawn from the same distributionP (y, x). A natural measure of the estimator effectiveness is an average error E(x) ≡ED [E(x; D)] = ED [E[(y − fD (x))2 |x, D]] over all possible training sets D of fixed size: E(x) = V ar(y|x) + (ED [fD (x)] − E[y|x])2 + ED [(fD (x) − ED [fD (x)])2 ] (2.1.2) intrinsic error squared bias b2 (f |x) variance var(f |x)The first term is an intrinsic error that can not be altered. If on average, fD (x) isdifferent from E[y|x], then fD (x) is biased. As we can see, an unbiased estimator maystill have a large mean squared error if the variance is large. Thus, either bias or variancecan contribute to poor performance (Geman et al., 1992). When training with a fixed
  21. 21. Chapter 2: Statistical formulation of the problem 10training set D, reducing the bias with respect to this set may increase the variance ofthe estimator and contribute to poor generalization performance. This is known as thetradeoff between variance and bias.2.2 Variance control without imposing a learning biasThe variance portion of a prediction error can sometimes be reduced without a bias in-troduction by ensemble averaging. An ensemble (committee) is a combination of singlepredictors trained on the same task. For example, in neural networks, an ensemble is acombination of individual networks that are trained separately and then their predictionsare combined. This combination is done by majority or plurality rules (in classification)(Hansen and Salamon, 1990) or by a weighted linear combination of predictors in regres-sion (Meir, 1994; Naftaly et al., 1997). The plurality rule is defined as the decision agreedby the majority of networks. The majority rule is defined as the decision agreed bymore than half of the networks, otherwise the ensemble rejects to classify and an error isreported. The most general method to create ensemble has been presented by Wolpert(Wolpert, 1992). The method is called stacked generalization and a non-linear networklearns how to combine the network outputs with the weights that vary over the featurespace. It is well known that ensemble is useful if its individual predictors are independentin their errors or disagree on some inputs. Thus, the main question is to find networkcandidates that achieve this independence. One of the widely spread methods to createneural network ensembles is based on the fact that neural networks are non-identifiablemodels, i.e. the selection of the weights is an optimization problem with many localminima. Thus, a network ensemble is created by varying the set of initial random weights(Perrone, 1993). Another way is to use different types of predictors, like a mixture ofnetworks with a different topology and complexity or a mixture of networks with completelydifferent types of learning rules (Jacobs, 1997). Another way is to train the networks ondifferent training sets. Below, a bias-variance error decomposition for a weighed linearcombination of predictors is presented (Raviv, 1998; Tesauro et al., 1995). Let us consider M predictors fi (x, Di ), each trained on a training set Di . All trainingsets have the same size and are drawn from the same joint distribution P (y, x). Considerthe ensemble based on the linear combination of predictors: fens (x) = ai fi (x, Di ), i ai = 1, ai ≥ 0, i = 1, 2, . . . , M. (2.2.3) i
  22. 22. Chapter 2: Statistical formulation of the problem 11The normalization condition i ai = 1 is implied to make an ensemble unbiased, wheneach individual estimator fi is unbiased. Let us consider the error (2.1.2) for this ensemble: Eens (x) = V ar(y|x) + b2 (fens |x) + var(fens |x), (2.2.4)where the bias b(fens |x) is given as: b(fens |x) = ED1 ,D2 ,...,DM [ ai fi (x, Di ) − E[y|x]] = i ai EDi [fi (x, Di ) − E[y|x]] = ai b(fi |x). (2.2.5) i iThus the bias of the ensemble is the same linear combination of the biases of the estima-tors. Expanding the ensemble variance term we get: var(fens |x) = ED1 ,D2 ,...,DM [{ ai fi (x, Di ) − ED1 ,D2 ,...,DM [ ai fi (x, Di )]}2 ] = i i ED1 ,D2 ,...,DM [( ai fi (x, Di ) − ai EDi [fi (x, Di )])2 ] = i i ED1 ,D2 ,...,DM [( ai (fi (x, Di ) − EDi [fi (x, Di )])2 ] = i ED1 ,D2 ,...,DM [ a2 (fi (x, Di ) − EDi [fi (x, Di )])2 + i i 2 ai aj (fi (x, Di ) − EDi fi (x, Di ))(fj (x, Dj ) − EDj fj (x, Dj ))] = i>j = a2 var(fi |x) + 2 i ai aj EDi ,Dj [(fi − EDi [fi ])(fj − EDj [fj ])] i i>jFinally, we get the next expression for the ensemble error: Eens (x) = V ar(y|x) + ( ai b(fi |x))2 + a2 var(fi |x) i i i +2 ai aj EDi ,Dj [(fi − EDi [fi ])(fj − EDj [fj ])] (2.2.6) i>jIf all estimators are unbiased, uncorrelated and have identical variances, simple averagingwith the same weights ai = 1/M leads to the following ensemble error (Raviv, 1998): 1 E(x) = V ar(y|x) + b2 (f |x) + var(f |x). MThis decomposition shows that when biases are small and predictors are independent asignificant reduction of order 1/M in the variance may be attained. If estimators are unbiased and uncorrelated it is easy to show that optimal weights 1have to be inversely proportional to the variance of the individual predictors ai ∝ var(fi |x) ,(Tresp and Taniguchi, 1995; Taniguchi and Tresp, 1997). Intuitively it means that apredictor that is uncertain about its own prediction should obtain a smaller weight.
  23. 23. Chapter 2: Statistical formulation of the problem 122.3 Variance control by imposing a learning biasA regression function (E[y|x]) is the best estimator. In order to find an unbiased estimator,a family of possible estimators has to be abundant. In the MLP (multi-layer perceptron)networks, this may be attained at the expense of network architecture growing. Thiseliminates bias, but increases variance unless the training data is infinite. In practice, thetraining data is finite and the main question is to make both a bias and variance “small”using finite training sets (Geman et al., 1992). Geman et al. point out that in thislimitation the learning task is to generalize in a very nontrivial sense, since the trainingdata will never “cover” a space of possible inputs. This extrapolation is possible, if theimportant generalizations are prewired in learning algorithms by purposefully introducinga bias. The most general and weakest a-priori constraints assume that mapping is smooth.Other, stronger a-priori constraints may be expressed as an invariance of the mappingto some group of transformation or an assumption about the class of possible mapping.Another type of specific bias constraints appears when a supervised task is learned inparallel with its other related tasks. One way to categorize different types of constraints into two groups: variance andbias constraints, has been proposed in (Intrator, 1999). Both types of constraints serveto reduce the variance portion of the generalization error, however they have a differenteffect on the bias portion of the error. Variance constraints always result in an increase ofthe bias portion of the error. In contrast, bias constraints assist in learning and even mayreduce the bias portion of the error. When networks are learned to satisfy constraints only,the bias constraints lead to a meaningful hidden representation, capturing the structureof the input domain; while a hidden representation extracted via the variance constraintsis less interesting.2.3.1 Smoothness constraintsThe easiest way to smooth the mapping approximated by neural networks is by controllingnetwork structure parameters such as numbers of hidden units and hidden layers. Thelarger is the number of network units, the larger is the number of weight fitting parameters.The over-parameterized models are highly flexible and reduce bias. However, they aresensitive to noise that leads to a large variance and a large generalization error. Anotherway to control smoothness in neural networks, borrowed from the spline theory (Wahba,1990), is to use weight decay. This involves adding a penalty term controlling a weight’s 2norm, to the network cost function E = i yi − f (xi , ω) (other forms of cost functions
  24. 24. Chapter 2: Statistical formulation of the problem 13are presented in (Bishop, 1995a)): 2 Eλ = E + λ ω ,where xi and yi are the suitably scaled input and output samples ( z is the normin the space of the element z). Another tightly related approach is to constrain a rangeof the weights to some middle values. The method is called weight elimination and the 2 2 2regularization term has the form λ i ωi /(ωi + ωi0 ). A direct approach is to consider aregularizer which penalizes curvature explicitly: 2 Eλ = E + λ Pf ,where P is a differential operator. Another way to control the smoothness is to inject noiseduring the learning. The noise is usually added to the training data (Bishop, 1995a; Ravivand Intrator, 1996), but may be added to the hidden units (Munro, 1997) or weights(Murray and Edwards, 1993) during learning as well. It has been shown (Bishop, 1995b)that learning with input noise is equivalent to Tikhonov (direct curvature) regularization.Though smoothness constraints bias toward smooth models, they are essentially varianceconstraints.2.3.2 Invariance bias constraintsGiven an infinite training data and unlimited training time, a network can learn theregression function. However, the data is rather limited in practice and this limitationmay be overcome by imposing bias as invariance constraints. One way to implement thisregularization is by training the system with additional data. This data is obtained bydistorting (translating, rotating, etc.) the original patterns (Baird, 1990; Baluja, 1996),while leaving the corresponding targets unchanged. This procedure, called the distortionmodel, has two drawbacks. First, the magnitude of distortion and the number of artificialdegraded patterns have to be defined. Second, the generated data is correlated withthe original training data. This type of regularization is referred to as a data drivenregularization (Raviv, 1998). An alternative way is to impose invariance constraints by adding a regularization termto the mean squared error E (Simard et al., 1992). The regularization term penalizeschanges in the output when the input is transformed under the invariance group. Letx be an input, y = f (x, w) be the input-output function of the network and s(α, x) atransformation parameterized by some parameter α, such that s(0, x) = x. When theinvariance condition for every pattern xµ is written as: f (s(α, xµ ), w) − f (s(0, xµ ), w) = 0 (2.3.7)
  25. 25. Chapter 2: Statistical formulation of the problem 14the latter constraint for an infinitesimal α may be rewritten as: ∂f (s(α, xµ ), w) |α=0 = 0, or ∂α ∂s(α, xµ ) fx (xµ , w) · tµ = 0, tµ = |α=0 , (2.3.8) ∂αwhere fx is the Jacobian (matrix) of the estimator f for a pattern xµ , andtµ is a tangentvector associated with the transformation s. The penalty term is written as Ω(f , w) = 2 µ f x · tµ and a penalized function is Eλ = E + λΩ(f , w). This regularization termstates that the function f should have zero derivatives in the directions defined by thegroup of invariance and is called tangent prop. The tangent prop is an infinitesimal form of the invariance ”hint” proposed by Abu-Mostafa (Abu-Mostafa, 1993). The conditions of equivalence between adding distortedexamples and regularized cost function are presented in (Leen, 1995). In particular, it isshown that smoothed regularizers may be obtained as a special case of a random shiftinginvariance group: s(x, α) = x + α, where α is a Gaussian variable with a sphericalcovariance matrix. Obviously, non-trivial invariance constraints belong to a bias type ofconstraints.2.3.3 Specific bias constraintsThese constraints express our a-priori heuristic knowledge about the problem. A com-bination of the Exploratory Projection Pursuit (EPP) method with Projection PursuitRegression (PPR) in feed-forward neural networks (Intrator, 1993a; Intrator et al., 1996;Intrator, 1999) and the multi-task learning (MTL) method (Caruana, 1995), are examplesof this type of the bias constraints.Hybrid EPP/PPR neural networksPPR is a method to perform dimensionality reduction by approximating the desired func-tion as a composition of lower dimensional smooth functions that act on linear dimensionalprojections of the input data (Friedman, 1987). In other words, PPR tries to approximatethe best estimator, that is a regression function f (x) = E[Y |X = x] from observationsD = {(xµ , yµ )}n by a sum of ridge functions gj (functions that are constant along lines): µ=1 m f (x) ≈ gj (aj · x), j = 1, . . . , m. (2.3.9) j=1 In the feed-forward neural networks, the ridge functions are set in advance (as logistic
  26. 26. Chapter 2: Statistical formulation of the problem 15sigmoidal, for example) and the output is approximated as m f (x) ≈ βj σ(aj · x), j = 1, . . . , m, x, aj ∈ Rd (2.3.10) j=1where an input vector x is usually extended by adding an additional component equalto 1. Thus, in neural networks only projection directions aj and coefficients βj haveto be estimated. However, when the input is high-dimensional, even the dimensionalityreduction neural networks (m d) are over-parameterized models that require additionalregularization constraints. The already considered smoothness constraint is one way to reduce a variance ofthe network. Another way to impose bias constraints related to the data structure hasbeen proposed by Intrator (Intrator, 1993a). An idea is to train a network (via a back-propagation algorithm) to fit the desired output and to extract a low-dimensional structureof the data using EPP (Friedman, 1987) simultaneously. EPP is an unsupervised methodthat searches in the high dimensional space directions with good clustering properties,characterized by projection indices. An example of combination of supervised learningwith unsupervised using a BCM (Bienestock Cooper and Munro) neuron (Bienenstocket al., 1982; Intrator and Cooper, 1992) has been proposed in (Intrator, 1993b). Thisneuron is learned by minimizing a specific projection index that emphasizes the multi-modality in the data. Computationally, EPP constraints are expressed as minimization of a function ρ(w)measuring the quality of the input after projection and a possible nonlinear transformationφ: ρ(w) ≡ E[H(φ(w · x))], where φ(w · x) is a hidden representation A of the network,H is a function measuring the quality of the hidden representation, and averaging takesplace over an ensemble of the input. The EPP constraints are introduced by modificationa synaptic weight learning rule: ∂wij ∂E(w, x) ∂ρ(w) =− [ + + C], (2.3.11) ∂t ∂wij ∂wijwhere C is an additional complexity penalty term, such as smoothness constraints or thenumber of learning parameters.Multi-task learning (MTL)Another attractive intuitive way to conceive different types of the bias constraints is MTL.MTL is a wide-spread method used in the machine learning. It proposes to learn additionaltasks defined on the same data domain as the special task for improving the generalizationability of the latter. Though the MTL idea is borrowed from the observation that humans
  27. 27. Chapter 2: MDL and Bayesian principles 16successfully learn many related tasks at once, it has a rigorous mathematical base. It iseasy to see that the additional task learning in MTL emerges as a bias imposing mecha-nism, that controls the balance between the bias-variance portions of the generalizationerror. The MTL approach in the artificial networks is realized via connectionist networkarchitectures. In connectionist network one shared representation is used for multipletasks. The hidden weights, connected input and this shared representation are updatedas a linear combination of the multi-task gradients in the back propagation of their errors.Such learning moves the shared hidden layer towards representations that better reflectregularities of the input domain. Though the measure of task relation can not be rigorously defined, some mechanismsexplaining the benefit of MTL have been suggested (Caruana, 1995; Abu-Mostafa, 1994).Nevertheless, the way to test the appropriateness of the related task as a proper biasis empirical. It is easy to see that the combination of EPP and PPR neural networkscan be also considered in the MTL framework, though in MTL, a related task is usuallyexpressed more loosely and heuristically than the EPP constraints.2.4 Reconstruction bias constraintsAs shown above in Section 2.3.3, feed-forward Neural Networks which require estimationof many parameters, are subjected to the bias/variance dilemma. We have seen also inSections 2.2–2.3 that different ways to control the bias/variance portion of the predictorerror exist. However, when the dimensionality of the input is very high, innovative waysto reduce the variance portion of the error, as well as methods to impose (reasonable)bias, are required. In this thesis, continuing the previous line of study, we propose a new kind of spe-cific bias constraints for image classification feed-forward networks in the form of theimage reconstruction. We also consider new information theory constraints, seeking di-verse structure in the data and compare the effect of the different constraints on thegeneralization performance of the classification neural network. Below, we discuss Bayesian and minimum description length (MDL) frameworks forlearning in neural networks. We show that the bias-variance dilemma can be naturallyreformulated in the MDL framework, where learning constraints emerge as a model-cost,that has to be minimized along with an error-cost, which is represented as the meansquared error (MSE) on the main learning task.
  28. 28. Chapter 2: MDL and Bayesian principles 172.5 Minimum Description Length (MDL) PrincipleIn the MDL formulation, one searches for a model that allows the shortest data encoding,together with a description of the model itself (Rissanen, 1985). One of the first perspec-tives for applying the MDL principle in Neural Networks was pointed out by Nowlan andHinton (1992) for supervised learning. In supervised learning, the output y is predictedfrom the input x which is presented at the input layer. The network model is defined bythe weight parameters. Thus, to specify the desired output y given x, the weights anderrors in the output layer have to be described. If it is assumed that the output errorsare Gaussian, then the number of bits to describe the errors is equal to the mean-squaredrecognition error. The weights are encoded using different weight probability modelsand their descrition length is a negative log of weight probabilities. The weight descrip-tion length is equivalent to different complexity terms and the MDL principle leads to aregularization approach in the Neural Networks. For example, the Gaussian probabilisticmodel leads to the weight decay regularization term (see Section 2.7). A more sophis-ticated form of weight decay is obtained when the weights are encoded as a mixture ofGaussians (Nowlan and Hinton, 1992). Later on the MDL principle was applied for unsupervised learning, in particular forautoencoder networks (Zemel, 1993) (see also Section 3.1.2). The autoencoder networkis a feed-forward network which duplicates the observed input in the output layer. Theautoencoder network has a natural interpretation in the MDL framework (Hinton andZemel, 1994). It discovers an efficient way to communicate data to a receiver. A senderuses a set of input-to-hidden weights and, in general, non-linear activation functions toconvert the input into a compact hidden representation. This representation has to becommunicated to the receiver along with the reconstruction errors and hidden-to-topweights. Receiving the hidden-to-top weights, the receiver reconstructs the input fromthis abstract representation and communicated errors. The description length in this caseconsists of three parts: 1. The set of activities A of the representation units. These are codes that the net assigns to each training input sample. Encoding activities of the representation (hidden) units enables to avoid communication of the hidden weights and does not require the knowledge of the input data X . However, the sender and the receiver have to agree on the a-priori distribution of the internal representation. This part of the message corresponds to the representation-cost. 2. The set of hidden-to-output weights W . This part of the message is represented by the weight-cost.
  29. 29. Chapter 2: MDL and Bayesian principles 18 3. The reconstruction error, which is a disagreement between desired and predicted outputs. This part of the message is represented by the reconstruction or the error- cost. In order to evaluate the latter, the sender and receiver have to agree on the probability of the desired output of the network given its actual output. In the standard autoencoder, the weight cost is neglected and the representation costis considered to be small and proportional to the number of network hidden units, since itis assumed that all units participate in the equal parity in the data representation. How-ever, instead of the direct evaluation of the representation code, the autoencoder with abottleneck in the hidden layer is trained to minimize the MSE reconstruction error. Incontrast, in the nonstandard versions of autoencoders (Zemel, 1993), the representationcost is evaluated explicitly and its minimization encourages sparse distributed represen-tation, where only few neurons are active, which are responsible for the presence of thespecific features in the patterns. The main difference between the MDL principle for supervised and unsupervised learn-ing proposed by Zemel may be understood considering the unlimited number of trainingsamples. When the number of patterns is infinite, the model cost of the supervisedlearning, which is the cost of the weights, is negligent. In contrast, in the unsupervisedlearning, the model cost never vanishes and the MDL is applied per sample to minimizerepresentation cost and to maintain data fidelity. In this thesis, we combine supervised and unsupervised learning in the hybrid re-construction/recognition network and formulate the MDL principle for this case (see Sec-tion 3.2.3). It turns out that this interpretation is three-fold, depending on what is definedas the main task: 1. When the main task is reconstruction (Gluck and Myers, 1993, a hippocampus model), the reconstruction MSE is an error cost and the recognition MSE is a model cost (or a representation cost, since the MSE recognition error depends on the hidden layer representation and the recognition top weights that must not affect on the description length). Thus, the network maintains the data fidelity and encourages representation with a good discriminative property. 2. When the main task is recognition and it is assumed that the sender observes both the input and output, while the receiver sees only the input, the recognition MSE is an error cost as in supervised learning and the reconstruction MSE is a model cost (or a representation cost). However, in contrast to a standard supervised learning the representation cost never vanishes.
  30. 30. Chapter 2: MDL and Bayesian principles 19 3. When the main task is recognition, but the receiver does not see both x and y, he has in parallel to reconstruct x and predict y. Thus, the sender encodes x, taking into account also the dependence of y on x. He sends the encoded data and errors of recognition and reconstruction outputs, since in the supervised learning the task is to predict y for the given x. In this case, both the recognition and reconstruction MSE stand for error codes and the representation cost is restricted to a small number of the hidden units.2.5.1 Minimum description lengthMDL can be formulated based on an imaginary communication game, in which a senderobserves the data D and communicates it to the receiver. Having observed the data, thesender discovers that the data has some regularity that can be captured by a model M.This fact encourages the sender to encode the data using a model, instead of sending thedata as it is. Due to noise, there are always aspects of the data which are unpredicted bythe model, that can be seen as errors. Both the errors and the model have to be conveyedto the receiver to enable him to reproduce the data. The goal of the sender is to encodedata so that it can be transmitted as accurately and compactly as possible. It is clear, that complex models allow to achieve a high accuracy, but their descriptionis expensive. In contrast, models which are too simple or wrong, are not able to extract thedata regularity. Intuitively, such a communication game can be thought of as a tradeoffbetween the compactness of the model and its accuracy. To transmit the data the sender composes a message consisting of two parts. The firstpart of the message with a length L(M ) specifies the model and the second with a lengthL(D|M ) describes the data D with respect to the model M. The goal of the sender isto find a model that minimizes the length of this encoded message L(M, D), called thedescription length: L(M, D) = L(D|M ) + L(M ), (2.5.12) According to Shannon’s theory (Shannon, 1948; Cover and Thomas, 1991) to encodea random variable X with the known distribution p(X) by the minimum number of bits,a realization x has to be encoded by − log p(x) bits. Thus the description length (2.5.12)is represented as: L(M, D) = (− log p(D|M ) − log p(M )), (2.5.13)where p(D|M ) is the probability of the output data given the model, and p(M ) is ana-priori model probability. The MDL principle requires searching for a model M that
  31. 31. Chapter 2: MDL and Bayesian principles 20minimizes the description length (2.5.13): M = arg min(− log p(D|M ) − log p(M )). (2.5.14) M As we have seen in Section 2.1, in the supervised learning the problem is to find a modelthat describes output y as a function of input x based on the available input/output pairsD = {(xµ , yµ )}n . In a standard application of MDL to supervised learning, the output y µ=1is treated as the data D that has to be communicated between the sender and the receiver,while the input data X is assumed to be known by them. Therefore, all the probabilitiesin the formula (2.5.13) are conditioned on the input data, i.e. p(M ) ≡ p(M |X ) andp(D|M ) ≡ p(D|M, X ). However, to simplify the notation we omit X in these expressions. The connection between MDL and Bayesian theory for Neural Networks is demon-strated in the next section.2.6 Bayesian frameworkIn the Bayesian framework, one seeks a model that maximizes a posterior probability ofthe model M given the observed input/output data (X , D): p(D|M, X )p(M |X ) p(M |D, X ) = , (2.6.15) p(D|X )Usually, in the feed-forward networks trained by supervised learning the distribution ofthe input data p(x) is not modeled1 . Thus, in (2.6.15), X always appears as a conditioningvariable, which we omit to simplify the notation (similar to the convention accepted forthe description length evaluation): p(D|M )p(M ) p(M |D) = . (2.6.16) p(D) Since p(D) does not depend on the model and the most plausible model M has tominimize the negative logarithm of the posterior probability, we get: M = arg min[− log(p(D|M )) − log(p(M ))]. (2.6.17) M Usually, to apply both the MDL and Bayesian frameworks, one decides in advance on aclass of parameterized models and then searches within this class of parameters to optimizea corresponding criterion. The probability of the data, given a model parameterized byw, can be computed by integrating over the model parameter distribution: p(D|M ) = p(D|M, w)p(w|M )dw. (2.6.18) 1 In Section 3.2.3 we will consider the effect of such modelling.
  32. 32. Chapter 2: MDL and Bayesian principles 21Using the Bayesian formula we get: p(w, D|M ) p(D|M, w)p(w|M ) p(w|M, D) = = , (2.6.19) p(D|M ) p(D|M )that shows that a posterior probability of the weights p(w|M, D) is proportional top(D|M, w)p(w|M ). It is usually assumed that a posterior probability of the weightsp(w|M, D) is highly peaked at the most plausible parameter w , and the integral (2.6.18)may be approximated by the height of the peak of the integrand p(D|M, w)p(w|M ), timesa width of this distribution ∆w|M,D (MacKay, 1992): p(D|M ) ≈ p(D|w , M ) × p(w |M )∆w|M,D (2.6.20) best f it likelihood Occam f actorThe quantity ∆w|M,D is the posterior uncertainty in w. Assuming that the prior p(w |M )is uniform on some large interval ∆0 w, representing the range of values of w that the 1model M admits before seeing the data D, p(w |M ) simplifies to p(w |M ) ≈ ∆0 w , and ∆w Occam f actor = . (2.6.21) ∆0 wThus the Occam factor is the ratio of the posterior accessible volume of the model pa-rameter space to the prior accessible volume. Typically, a complex model with manyparameters, has larger prior weights uncertainty ∆0 w. Thus, the Occam factor is smallerand it penalizes the complex model more strongly (MacKay, 1992). Another interpretation of the Occam factor is obtained by viewing the model M ascomposed of a certain number of equivalent sub-models. When data arrive, only onesub model survives and thus the Occam f actor appears to be inversely proportional tothe number of sub models. Thus, − log(Occam f actor) is the maximal number of bitsrequired to describe/indicate this remaining sub model. Using the Occam factor (2.6.21) the condition (2.6.17) states that the most plausiblemodel has to minimize the description length: L(M, D) = − log p(D|w , M ) − log p(M ) − log(Occam f actor) (2.6.22) inaccuracy f or the best parameters model complexityThe first term in (2.6.22) is the ideal shortest message that encodes the data D usingw and characterizes inaccuracy of the model prediction for the best parameters. Thesecond term characterizes the complexity of the model. The more complex the modelis, the less is the discrepancy between the data and their prediction, but this accuracyis achieved at the expense of the model description. This relationship between a modelaccuracy and complexity is tightly related to the bias-variance dilemma considered in
  33. 33. Chapter 2: MDL and Bayesian principles 22the previous section. We have seen that the introduction of many parameters leads to abetter accuracy (decreases bias), but incurs high variance. Thus MDL and the Bayesianapproach offer the natural way to resolve the dilemma by seeking a model with a goodgeneralization ability. Another MDL interpretation to (2.6.20) is straightforward: L(D, M ) = − log p(D|w , M ) − log p(w |M ) − log ∆w|M,D − log p(M ). (2.6.23) error−cost weight−cost precision−costThe first term in (2.6.23) is the length of the ideal shortest message that encodes thedata D using the best parameters w . The second term is the number of bits requiredto encode the best model parameters. In addition, the negative logarithm of uncertaintyabout parameters after observing the data (− log ∆w|M,D ) penalizes models which haveto be described with a high precision to fit the data. Usually, the third component isneglected since model parameters are communicated only once, while the data arrive oneafter another. A way to take the third component into consideration in neural networks,but neglecting the second term, describing the a-priori knowledge about the model pa-rameters, has been considered in (Hochreiter and Schmidhuber, 1997).2.7 MDL in the feed-forward NNA feed-forward neural network is an example of the parameterized models that is rep-resented graphically as a feed-forward diagram of several layers of activation units, con-nected by the so called synaptic weights that represent the model parameters. The neuralnetwork architecture allows to evaluate the output data as a function of the input data.The network is supplied by the input data presented in the low input layer of the network.The input is successively propagated via the hidden layers using the weights and networkunits’ activation functions in the forward direction to get the output data D in the topoutput layer of the network. The network weights, the number of hidden units and theactivation unit functions are the main parameters that define the network complexity. Ingeneral, it is often assumed that the network architecture is already defined and the mainproblem is to find the weight parameters. Implementing the MDL principle in neural networks is easy. For simplicity we considertraining a single hidden layer feed-forward neural network (Figure 2.1). Neglecting thethird term in the description length (2.6.23) and assuming that the models have the same
  34. 34. Chapter 2: MDL and Bayesian principles 23 Supervised feed-forward network Output W - top weights Hidden representation - A w - hidden weights Input - XFigure 2.1: Feed-forward supervised network. A single arrow between two layers indicatesthat the units of both layers are fully connected.a-priori probabilities p(M ) an optimal weight vector has to minimize 2 : L(M, D) = − log p(D|w, W, M ) − log p(w, W|M ) +const (2.7.24) error−cost model−costThe first term in this expression is the error-cost of specifying the data for the givenweights, i.e. the cost of specifying the errors between true and predicted by the modelswith the given weights outputs. The second term is the model-cost. To evaluate the error-cost, the receiver and the sender have to agree on the specificform of the conditional distribution of the output t ∈ Rn . In the assumption of theindependent Gaussian additive noise with zero mean in the output layer, the posteriorprobability of the output is given by: 1 λ ˆ(x, w, W) − t 2 p(t|x, w, W) = exp(− t ), (2.7.25) C n (λ) 2 2πwhere C(λ) = λ and the parameter λ is inversely proportional to the Gaussian variance 2(λ = 1/σ ). Provided the samples are drawn independently from the distributions (2.7.25) we get: r p(D|w, W, M ) = p(ti |xi , w, W), (2.7.26) i=1 2 We have omitted the super-index for convenience
  35. 35. Chapter 2: MDL and Bayesian principles 24where r is the number of training samples. The assumptions (2.7.25) and (2.7.26) produce 1 λ p(D|w, W, M) = nr exp(− ED ), where C (λ) 2 r ED = ˆ(xi , w, W) − ti t 2 . (2.7.27) i=1 When the weight probability distribution is Gaussian and the hidden w and topweights W are independent we get: p(w, W|M ) = p(w|M )p(W|M ) 1 γw p(w|M ) = Nw exp(− w − mw 2 ), C (γw ) 2 1 γW p(W|M ) = N exp(− W − mW 2 ), (2.7.28) C W (γW ) 2where Nw , NW are numbers of the hidden and top weights, coefficients γw , γW are inverselyproportional to the corresponding Gaussian variances and mw , mW are mean values of thehidden and top weights, respectively. Assumptions (2.7.25,2.7.28) lead to the followingexpression for the description length (2.7.24): λ γw 2 γW 2 L(M, D) = ED + w − mw + W − mW + 2 2 2 error weight decay Nw log C(γw ) + NW log C(γW ) + nr log C(λ) + const (2.7.29)The first term may be recognized as an error and the next as a modified weight decayterm. The third term is constant for a chosen net architecture. Thus, the weight-decayterm controls a network complexity imposing smoothness constraints. Another form ofweight decay term has been obtained by modelling the weights as a mixture of Gaussians(Nowlan and Hinton, 1992). There is a deep relationship linking the MDL approach and regularization techniques.The intuitive idea is that complex models can fit better training data, but are not robustto small variations in the data. This relationship between a generalization ability of themodel and its complexity is related to the bias-variance dilemma in statistics (Gemanet al., 1992): over-parameterized models have high variance, while restricting the modelparameters incurs a high bias in the generalization error. The MDL formulation allowsto control bias and variance in a natural way.2.7.1 MDL and EPP bias constraintsLet us assume again that a network architecture, such as a number of hidden units andnonlinear activation functions, is fixed. Nevertheless, does there exists another way to
  36. 36. Chapter 2: MDL and Bayesian principles 25control complexity of the network? It turns out that this can be done by imposing biasconstraints on the supervised neural network. A general framework for imposing EPPbias constraints in neural networks (Figure 2.2) has been considered in Section 2.3.3.We have seen that computationally these constraints are expressed as a minimization Hybrid network with EPP constraints Output W - top weights Hidden Bias constraints representation - A w - hidden weights Input - XFigure 2.2: A hybrid feed-forward network with exploratory projection pursuit (EPP)constraints. A single arrow between two layers indicates that the units of both layers arefully connected.of some function H, measuring the quality of the hidden layer representation A, andaveraged over an ensemble of the input. In other words, EPP constraints are constraintson the specific form of the hidden representation that are known a-priori. Thus, theprojection index ρ(w) is a complex function depending on the hidden weights via thehidden representation A: ρ(w) ≡ E[H(A)], where A = f (w, x) and H measures thequality of the hidden representation. This form of constraints may be easily wired inthe MDL framework assuming a particular form of a-priori probabilities of the hiddenweights: µ p(w|M ) = CH (µ) exp(− E[H(f (w, x))]), (2.7.30) 2where CH (µ) is a normalization constant. The a-priori probability p(w|M ) (2.7.30) doesnot depend on the input x explicitly, although it does, since in the Bayesian formulation(2.6.16) all the probabilities have to be conditioned by the input data X . Assuming
  37. 37. Chapter 2: MDL and Bayesian principles 26independence of the hidden and top weights, we get: 1 1 L(M, D) = λED + µE[H(A)] − log p(W|M ) +const. (2.7.31) 2 2 weight−cost error−cost representation−costThe expression for the description length (2.7.31) gives a deeper level of description tothe data communication and is close (though not equivalent) to Zemel’s interpretation ofMDL (Zemel, 1993). In Zemel’s interpretation one gets a more realistic interpretation of the communicationgame, where a real communication takes place between the hidden layer with internalrepresentation A and the top layer. The receiver requires three items in order to be ableto recover the desired output: 1. The set of activities A of the representation units; these are codes that nets assign to each training input sample. Encoding activities of the representation (hidden) units avoids communication of the hidden weights and does not require the knowledge of the input data X . However, the sender and the receiver have to agree on the a-priori distribution of the internal representation. This part of the message corresponds to the representation-cost. 2. The set of hidden-to-output weights W . This part of the message is represented by the weight-cost. 3. Reconstruction error, which is a misfit between desired and predicted outputs. This part of the message is represented by the reconstruction or the error-cost. In order to evaluate the latter, the sender and receiver have to agree on the probability of the desired output of the network given its actual output.Usually, the weight-cost, i.e. the number of bits required to communicate the hidden-to-top weights, is not taken into account, since it has to be communicated only once,while representation-cost and error-cost have to be sent for every sample. Thus, the maincommunication tradeoff takes place between representation and error costs. Reducingdimensionality of the data in the hidden layer, i.e. compressing the data, a shorterdescription is obtained, but at the same time the errors are larger. The MDL principle isa tool for achieving a good data representation that is compact and accurate. We see that similar to Zemel’s interpretation of MDL, imposing EPP constraints leadsto the description length (Eq. 2.7.31) that consists of three parts. It requires the sameagreement on probabilities of hidden representation and errors between the sender andreceiver as described above. However, the representation cost in (Eq. 2.7.31) is taken only
  38. 38. Chapter 2: MDL and Bayesian principles 27once for all samples, while in Zemel’s interpretation it is permanent and is assigned to eachtraining input sample. When the number of input patterns is infinite, the representationcost induced by EPP constraints is negligible. Thus, in a manner similar to supervisedlearning, EPP constraints lead to a model in which model cost vanishes as the number ofinput patterns becomes infinite. We postpone the consideration of the hybrid autoencoder network with reconstructionconstraints and its MDL interpretation to the next section, where reconstruction task andits application are considered.
  39. 39. Chapter 2: Regularization problem 282.8 Appendix to Chapter 2: Regularization problemRegularization may be expressed as a minimization problem with a goal function that isa penalized cost function: 2 Eλ = E + λΩ(f , w), E= yi − f (xi , ω) . iA large value of the regularization parameter λ leads to a network with a large bias(unless the regularization term captures the underlying structure of the data), while asmall value reduces bias but increases variance. Then the regularization task is to findan optimal parameter λ and corresponding model parameters ωλ providing the minimalgeneralization error: 2 Eλ = E[ y − f (x, ωλ ) ].This task is computationally very expensive.Split-sample validation and hold-out method The simplest way to find the regu-larization parameter is to use split-sample validation. This process includes the followingsteps for each tested value of the regularization parameter λ (this process is common forthe choice of the other regularization parameters, such as the number of hidden units, achoice of the early time stopping moment, etc.): • A random data is split into a training and validation set. Often 2/3 of the data is used for training and 1/3 for testing. • The training set is used for estimation of the predictor parameters by minimizing Eλ . • The validation set is used to test a prediction error (E). The validation set must not be used in any way during training. • The predictor with the smallest prediction error corresponds to the optimal regu- larization parameter λ.The generalization error of the best predictor is in general too optimistic. The predictionerror on a third separately kept data set, called the test set is more realistic and isoften reported as the result of the predictor accuracy. This method is called the hold-outmethod. The disadvantage of the split-sample validation and hold-out method is that theyreduce the amount of data available for both training and validation. Two methods that
  40. 40. Chapter 2: Regularization problem 29overcome this drawback are cross-validation and bootstrapping (Efron and Tibshirani,1993; Bishop, 1995a).Cross-validation In k-fold cross-validation, the data is divided into k subsets of (ap-proximately) equal size. A network is trained k times, each time leaving out one of thesubsets from the training set and using the omitted subset as a validation set to computean error. If k equals the sample set size, this is called “leave-one-out” cross-validation.“Leave-v-out” is a more elaborate and expensive version of cross-validation that involvesleaving out all possible subsets of v cases. A generalization error is then measured as anaverage performance over all possible validation tests. Cross-validation is an improvementon split-sample validation.Bootstrapping In many cases, bootstrap seems to be better than cross-validation(Efron and Tibshirani, 1993). In the simplest form of bootstrapping, the training data isbootstrapped, instead of repeatedly analyzing subsets of the data as in cross-validation.Given a data set of size n, a bootstrap sample is created by sampling n instances uni-formly from the data with replacement. Then the probability of the instance to remain inthe test set is (1 − 1/n)n ≈ e−1 ≈ 0.368; and to be in the training data is 0.632. Given anumber b of bootstrap samples, the average performance is evaluated as a weighted sumsof the training (Eitraining ) and testing (Eitesting ) errors: 1 b E= (0.632Eitraining + 0.368Eitesting ) (2.8.32) b i=1Usually the number of recommended bootstrap samples is between 200 − 2000 (Kohavi,1995). Cross-validation and bootstrapping require many runs that may be computationallyprohibitive, especially for the most interesting perception tasks, when the input dimension-ality is very high. Both cross-validation and bootstrapping work well for continuous errorfunctions, such as the mean squared error, but it may perform poorly for non-continuouserror functions, such as the misclassification rate.
  41. 41. Chapter 3Imposing bias via reconstructionconstraints3.1 IntroductionReconstruction is one of the important tasks of the complex visual processing. It isa process of reproducing the input via some reasonably well chosen model. It is com-monly assumed that there is a compression via a bottleneck model and thus, the inputis reproduced from a reduced internal representation. The oldest and widely spread re-construction method is Principal Component Analysis (PCA). PCA is an optimal linearcompression, that is based on minimization of the mean squared error between input andits reconstruction. A simple generalization of PCA, in the nonlinear case, is a nonlinearautoencoder. Below, we present both these models and discuss their relationship to theMDL principle. We proceed then with a more general notion of reconstruction via a generative modeland reexamine diverse applications of the reconstruction models. Finally, we introduce anovel method that uses reconstruction as a bias constraint to a supervised classificationtask.3.1.1 Principal Component Analysis (PCA)PCA is widely used in multivariate analysis (Duda and Hart, 1973). PCA, also known asthe Karhunen-Lo´ve transformation (Oja, 1982; Fukunaga, 1990), is a process of mapping ethe original data into a more efficient representation, using an orthonormal linear trans-formation that minimizes the mean squared error between the data and its reconstructedversion. It is well-known that the optimal orthogonal basis of the data space is formed by theeigenvectors of the covariance matrix of the data. New data representation is obtained 30
  42. 42. Chapter 3: Reconstruction constraints 31by projecting the data to this new optimal basis. The eigenvectors corresponding to thelargest eigenvalues are the most significant (accounting for most of the variance in thedata). Thus, discarding coordinates in these directions, leads to the largest error in themean-squared sense. Therefore, the coordinates corresponding to the small eigenvaluesshould be deleted first, when compression is performed. Different PCA algorithms using neural networks have been reported (Haykin, 1994,see review). The first PCA network proposed by Oja (1982), uses a Hebbian learningrule to find the first eigenvector corresponding to the maximal eigenvalue. It’s gener-alized version, called the generalized Hebbian network (GHA) (Sanger, 1989), extractsthe first successive eigenvectors and uses feed-forward connections only. A modificationof GHA, an adaptive principal component extraction (APEX) algorithm (Kung and Dia-mantaras, 1990), uses additional lateral connections to decorrelate network outputs. GHAand APEX are examples of reestimation and decorrelating types of the PCA algorithms,respectively. PCA using Hebbian networks has been considered as a first principle of perceptualprocessing (Miller, 1995; Atick and Redlich, 1992; Hancock et al., 1992; Field, 1994). Themain goal of these studies is to explore the similarities between the PCA eigenvectorsand the receptive fields of cells in the visual pathway. It may be shown (Fukunaga, 1990;Gonzalez and Wintz, 1993; Field, 1994), that for stationary and ergodic processes, PCA isapproximately equivalent to the Fourier transform. The natural images are not stationary,however, and their covariance matrix does not describe completely the data distribution.It has been recently shown (Hancock et al., 1992), that the first 3−4 eigenvectors extractedfrom Gaussian smoothed natural images resemble ”Gabor functions”, that provide goodmodels of cortical receptive fields. However, the following eigenvectors no longer look likecortical receptive fields. PCA extracts a fully distributed representation, because onlyfew neurons that carry most of the variance are kept, and thus all components of theobservation vector participate in its projection into the eigenspace. Below, we present autoencoder network that is tightly related to PCA and discuss itsinterpretation in the MDL framework.3.1.2 Autoencoder network and MDLAn autoencoder network (Figure 3.1) is a feed-forward multi-layer perceptron (MLP) net-work with the output layer coinciding with the input layer. Usually, it contains a singlehidden layer, though variants with additional hidden layers have been also considered(Kramer, 1991). The number of the hidden units is assumed to be much less than dimen-sionality of the input. Therefore, it reduces dimensionality of the input extracting the
  43. 43. Chapter 3: Reconstruction constraints 32 Autoencoder network architecture W - hidden-to-top weights w - hidden weights Figure 3.1: Reconstruction of the inputs is done from the hidden layer representation.so-called internal representation in the hidden layer. The autoencoder network has a natural interpretation in the MDL framework (Hintonand Zemel, 1994). It discovers an efficient way to communicate data to a receiver. Asender uses a set of input-to-hidden weights and, in general, non-linear activation functionsto convert the input into a compact hidden representation. This representation has tobe communicated to the receiver along with the reconstruction errors and hidden-to-topweights. Knowing the hidden-to-top weights the receiver reconstructs the input from thisabstract representation and communicated errors. From Eq. 2.7.24 the description length is composed of the error-cost and the model-cost. Assuming that the errors are encoded using a zero-mean Gaussian with the samepredetermined variance for each output unit, the error-cost is given by the sum of thesquared errors. Since in the autoencoder the hidden units are always active, the model costmay be approximated by the size of the hidden layer. Often, the model cost is ignored,and the MDL principle leads to a simple minimization of the sum of squared errorsvia a network with a bottleneck structure. Thus, the autoencoder learns the compactrepresentation of the input. In addition, the bottleneck structure forces the network tolearn prominent features of the input distribution which are useful for generalization. Thenetwork is robust to noise and may be used for pattern completion, when part of the inputis corrupted or absent. A linear one-hidden layer autoencoder is closely related to PCA, since its hiddenweights span the same subspace as found by principal eigenvectors (Bourlard and Kamp,1988). However, contrary to PCA, the hidden weights are not forced to be orthogo-nal and do not coincide with the hidden-to-top weights. The analytical solution of the

×