Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analysis and applications of artificial neural networks


Published on

Artificial neural network

Published in: Engineering
  • Be the first to comment

Analysis and applications of artificial neural networks

  1. 1. p i Analysis and i Applications of Artificial 1 I ii Neural Networks LP. Veelenturf
  2. 2. To Rulger, .Wendy, Irene and Gerrie Analysis and Applications of ‘-~'yl . . . A z Artificial Neural Networks L. P4 J. / eelenturf um "I1 Prentice Hall London New York luromn Sydney Tokyo Singapore Madrid Me-mu Cuy Munich
  3. 3. E First published 1995 by Prentice Hall international (UK) Limitctl Campus 400, Maylands Avenue Hemel Hempstead ‘ Hertfordshire, HP2 7EZ ". t 5 A division of , Simon & Schuster International Group (G Prentice Hall International (UK) Ltd. I995 All rights reserved. No part of this public ion may be reproduced. stored in a retrieval system, or lrttttsntitte , in any form, or by any means, electronic. mechanical, photocopying, recording or otherwise. without prior permission, in writing, from the publisher. For permission within the United es of America contact Prentice Hall tnc. , Englcwootl Ctittx. NJ 07632 Typeset In 10 on 12 pt Times by P& R Typcsettcrs Ltd. Salisbury. Wiltshire. UK Printed and bound in Great Britain by llt)0l<CH| il. Midsomcr Norton Library of Congress Cataloging-innPuliliczttioit Data Veelcnturi, L. P. J. (Lee P. J. ) Analysis and applications of rtrtilicinl neural networks / by L. P. J. Veelenturt’ p. cm. includes index. ISBN 0-I3-489832-X tcztsctll 1. Neural networks tContptItt: r science) I. Title. QA76.87tV42 I995 006.3--dcZ0 944102 I 5 CH’ British Library Cataloguing in Puhticzitinu Data A catalogue record for this book is l| Vi| lli| l‘llt£ from the British Library ISBN 0»! 3489832-X [hbk] l2345 99 ‘)8 ‘)7%‘)S 2:10 2.l. l 2.12 2.13 3.1 3.2 3.3 3.6 3.7 3.8 F. Contents Preface Acknowledgements Introduction Machines and brains The artificial neural network The binary Perceptron introduction The performance ofa single-neuron binary Perceptron Equivalent linear threshold functions Learning a single-neuron binary Perceptron with the reinforcement rule The Pcrceptron convergence theorem Performance of a two-layer binary Perceptron The adaptive recruitment learning rule Generalizing with a two-layer binary Perceptron The recruitment and reinforcement learning rule Application ofthe adaptive recruitment learning rule to switching circuits Application of the adaptive recruitment learning rule to hyphenation Application of the recruitment and reinforcement learning rule to contradictory binary data sets Exercises The continuous multi-layer Pereeptron introduction The gradient descent adaptation method Learning with a single-neuron continuous Perceptron The exact fitting of the data set with a single-neuron Perceptron The approximate fitting of the data set with a single~neuron Perceptron Generalizing with a single-neuron continuous Perceptron The classification of data with a single-neuron Perceptron Hyperplane boundary classification by one—zero labelling ix I2 21 24 32 35 45 49 53 58 6| 63 64 66 70 76 81 92
  4. 4. vi 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 Contents Hyperplane boundary classification by double threshold labelling Hyperplane boundary classification by single threshold labelling Application to the classification of normally distributed classes Learning rule for a two<layer continuous Perceptron Under-fitting and over-fittmglofa data set with a two-layer continuous Perceptron ' ‘ ' The class of functions realizable with a two-laycr Perceptron The three-layer continuous Perceptron Application ola two-layer continuous Perceptron to function identification Application of a two-layer Perceptron to the mushroom classification problem Application ofa two-layer Perceptron to the detection ofthe frequency of a sine wave Application 01:: multilayer Perccptron to machine condition monitoring The learning speed of a continuous multi-layer Perceptron lnitialization of weights and scaling the input and output Exercises The sell’-organizing neural network Introduction Anthropomorphic pattern recognition with a self-organizing neural network The equivalent sell-organizing neural net algorithm Vector quantization with the sell-organizing algorithm Weight vector ordering with the sell-organizing algorithm Application ofthe sell-organizing neural net algorithm to the travelling salesman problem Application of the self-organizing neural net algorithm to picture colour quantization Nearest-neighbour classification with a self-organizing neural net algorithm The Bayes classification with a self-organizing neural net algorithm Application of the sell‘-organizing neural net algorithm to the classification of handwritten digits Topology preservation with a sell-organizing algorithm Interpolation with the self-organizing algorithm Master—slave and multi-net decomposition of the sell-organizing neural net algorithm Application of the sell’-organizing algorithm to function itlentilieation Application of the self-organizing algorithm to robot arm control Application of the sell-organizing algorithm to EEG signal analysis Application of the sell~org: tnizing algorithm to speech recognition I06 112 120- 122 133 I36 145 I53 158 162 165 I66 I68 170 170 170 181 I84 I93 203 207 210 213 21‘) 222 226 229 230 234 237 240 4.18 4.19 ‘4_2o 4.21 Contents Selecting and scaling of training vectors Some practical measures of performance of the selfiorganizing neural net algorithm Application of the self-organizing algorithm to signature identification Exercises Bibliography lntlex vii 1 245 246 ' 252 ‘ 255 "-*1‘ 256 253
  5. 5. Preface ‘Any process that works can be understood; what cannot be understood is suspect. ‘ This is effectively what Marvin Minsky and Seymour Papert stated twenty years ago during the first wave of interest in artificial neural networks, This critical remark could be made again in the recent second wave of interest in the topic. because for many people neural networks still seem something ofa black art. Perhaps this attitude has been engendered by the tremendous amount of recent publications on neural networks, where it seems that every author invents his or her own type of neural network. which is frequently only justified by some particular application. For the reader it must be a chaotic collection of seemingly unrelated questions without a consistent framework and a unifying perspective. We certainly do not claim that we can give such a unified framework at this stage ofdevelopment ofthis new scientific field, but we can at least open the black boxes of the main types of neural networks that can be thoroughly understood and that have turned out to be very useful in a broad range of applications. A reason for the excitement about neural networks might be that in the literature on artificial neural networks one frequently encounters very promising and attractive statements about the generalization capability of neural networks, like: ‘Neural networks are capable of adapting themselves with the aid of a leaming rule and a set of examples to model relationships among the data without any a priori assumptions about the nature of the relationships. ‘ A similar statement is: ‘After learning neural networks they may be used to predict characteristics of new samples or to derive empirical models from examples in situations in which no theoretically based model is known. ’ Although to a certain extent these types of statements are true, one must be careful with the substatements that no model or a priori information about the nature of the relationship between examples is assumed. Generalization is the process of inductive inference of general relationships from a finite number of samples. An example is the inference of a new number in a finite sequence of numbers: one is inclined to say that the next example in sequence 1, 4, 9, 16,. .. will be the number 25, because we observe a simple regularity in the sequence: the kth element in the sequence is k‘. However, if no prejudice in favour of some type of ‘model’ exists, any number may follow the given sequence. For example, one might as well say that the next number is 27 because one is in favor of the regularity where the nth number yfn) in the sequence is defined by y(n)= ‘sum ix L
  6. 6. x Preface of first rt uneven primes‘. If one asks many people (who know elementary calculus! ) to guess the next number in the sequence given above. almost all will say that the next number is 25. This phenomenon reveals the human attitude to select always the most ‘simple‘ model to expl "rifla sequence of experimental observations. A neural network is capabletoli modelling relationships among data by learning from the examples but is always using some at priori set of models. ‘Modelling’ can be defined as the process of formulating a finite set of interrelated rules (or the construction ofa finite set of interconnected mechanisms) by which otte cart generate or explain the (potentially infinite) set of observed data. The simplicity of the model is very subjective because it depends on the domain of knowledge of the person (or of the mechanical process) that is doing the modelling. Ifone is using complex neural networks to ‘model’ the relationsltips behind a given set ofdata, it is hard to demonstrate that one is using (or assuming) :1 priuri information about the kind of relationship. We will. however, demonstrate in several sections of this book that one needs. or 2 *umes. a priori information about the relationsliip between the given data in order to bejustilied in accepting the outcome oltln: learning process ofa neural network. The a priori information one is using is determined by the type of neural network. the conliguration of the nettral network and the type of neural transfer functions. In the application of neural networks to the solution of real-life problems it is important to be aware of this phenomenon. For example. if we want to infer with a twovlayer continuous Perceptron from a given data . t an unknown functional relationship and select the number of ti . layer neurons in a Perceptron too low, then the unknown function will be underlitted (i. e. will not go through all data points); ifwc select the number of neurons too high, then the unknown function will be overfitted (the realized function will go through the data points but will fiuctuate wildly in between). This example. more extensively discussed in Section 3.13. shows that generalization by learning from examples and counterexamples is in general impos"ble without utilizing a prion‘ knowledge about the properties of the function to be it| ent' ' d. If one uses a neural network to find the relation behind the data in a data set. one is in addition assuming that the data set is representative of the (unknown! ) relation. Nevertheless, it must be said that by using neural networks one can solve in an optimal way certain problems that are hard to tackle by conventional method For instance. we will show that among the set of all classifiers that divide the n-tlimei onal input space by a n—l dimensional hyperplane, the single-neuron Perceptrnn '. an optimal classifier. We will concentrate in this book on these kinds of limitations and capabilities of the main type of neural networks, rather than giving a review of the multitntlc of different neural networks. Our aim is that the reader should profoundly untle land and be able to apply artificial neural networks to the solution of practical problems. Almost 70 per cent ofall publications deal with the type of networks we will ll 'Ily£8 in this book: the binary Perccptron. the continuous Pereeptron and the sell’-organi/ .ing neural network. and when we consider the applications that turn out to be useful, this percentage is ven higher. Preface xi The book is self-contained, which means that we will deliberately avoid tlie fashionable vicious circle of proving propositions by quoting the propositions of _ other authors. The methods ofanalyzing neural networks are largely ofa mathematical nature but the level of mathematical rigour in our exposition will nevertheless be low because we are far more concerned with providing insight and understanding than establishing a rigorous mathematical foundation. The required mathematical maturity is that ofa typical final-year undergraduate student in electrical engineering or computer science. Understanding is the ability to transform new phenomena to a coherent simple structure of already well understood phenomena. For this reason we will give many illustrative examples and plausible arguments in terms of what is supposed to be well known by the reader. A great number of real-lite applications will also contribute to this understanding and will show at the same time the powerful usefulness of neural networks. For several years we have taught courses at the University of Twentc based on the material in this book. The book can be covered in a one-semester course. Each chapter concludes with some exercises. The lists of literature are far from complete because we only want to give the reader a map for the main routes in the bewildering and chaotic landscape of published material. The exercises are meant as a means to check one’s own understanding of the presented theory. While writing this book we benefited from the comments of many colleagues and from the experiments performed by our students. We would like to thank especially Philip ele Bruin, Mark Bentum, Andre Beltman and Cuun Krugers«Dagneaux-Riikkers for their suggestions for improving the manuscript. Leo P. .l. Veelenturf M, u‘"
  7. 7. Acknowledgements While writing the first chapter of this book we benefited from the contribution of several of our students. We would like to thank Arend Lammertink for his investigations on hyphenation with a binary Perceptron and especially for his idea to decompose the neural network into many subnctworks in order to reduce the computation time; Erik Coelingh for the realization ofa software simulation program for the adaptive recruitment learning rule as well as for the recruitment and reinforcement learning rule; and Herco Reinders for his experiments on hyphcnation with the recruitment and reinforcement learning rule. Thanks are due to all the students who performed many experiments with the continuous multi-layer Perceptron. I want to mention Ronald van der Zee, Sander Lokerse, Berend Jan van der Zwaag, Peter Degen, Robert de Jong, Jan Wijnholt, Rolf van der Wal, Hans van't Spijker and Henk Martijn for their experiments on classification and function identification. Ronald Edens and Remco Couwenberg performed experiments on speech recognition with time delay and recurrent Perceptrons. A. Rensink, Manjo Nijrolder and Stephan ten Hage did the experiments with the harmonic signal detector. Peter Spekreijse and Cveert Pronk used the multi-layer Perceptron for machine condition monitoring. Guido Frederiks made a useful contribution to determining the configuration ofa neural net. Marcel Budding demonstrated that a neural net can be used for principal component analysis. John Kroeze made a contribution by analyzing neural networks with a piecewise linear approximation. Theo Roelands investigated the relation between the size of the data set and the complexity of the neural network. Special thanks are due to Erik Tromp for his proof that a two—layer neural network can approximate any continuous function that is arbitrarily close. I want to express my gratitude to Ian Zandhuizen who made me aware of the intriguing properties of Kohonen's self~organizing neural network. In order to obtain a better understanding ofthe self-organizing neural network and especially to evaluate the learning process parameters that are theoretically difiicult to establish. a lot of experiments were performed by our students. Investigations on anthropomorphic pattern recognitions were done by Gerrit van Brakel. Ruud Coppoolse, Barthold Lichtenbelt, Walter Jansen. Frank Staijen, and last but not least by Ronald Muller. Experiments on classification with the se| f»organizing neural network were xiii ally‘-
  8. 8. xiv Acknowledgements performed by Ronald van der Zee, Wieant Nielander, Geert Pronk and Sander Lookerse. Paul Buisman made an analysis of the behaviour of the original neural network introduced by Kohonen. The vector quantization piopelrties were investigated by Hans Ncggcrs. Marco Bloemendaal, Ruurd de Vries, Ruud Busschers. Erik Groot, Patrick Wilmerink. Stefan van Hal and Klaas Scheppink applied the self-organizing algorithm to spike-wave detection in EEG signals. Arjan Draaijer investigated the application to adaptive system control. Experiments on speech recognition were performed by Ronald Edens, Paul de Haan. Andre Bellman, Guido de Jong, Herman Woudstra and Stefan ten Hagc. Sake Buwalda developed an VHDL implementation of the sell‘-organizing algorithm. Finally I owe a debt of gratitude to Andre Beltman who was involved in almost all experiments mentioned above and who developed very practical mctisures for the evaluation of ordering and topology preservation with a self-organizing neural algorithm. 1 INTRODUCTION 1.i Machines and brains For several years the author has been involved in research on pattern recognition. In llltll period he became aware of the tremendous range of sophisticated methods used to analyze and to recognize pictures by machines. The pattern recognition machines were equipped with large numbers of vast and complicated algorithms. The most advanced machines could, for instance. recognize a certain class of handwritten digits. but in spite of the sophisticated nature of these machines they were limited to recognizing those pictures that had been foreseen by the system builders as potential elements to be recognized in the future. For example, one can build machins to recognize handwritten capital ‘A's but the system will fail to recognize a capital ‘A’ as given in Figure 1.1. lt is surprising that human beings can recognize the letter in Figure 1,1 as an ‘A‘ as it is very unlikely that one has ever seen the figure before. It is very unlikely that human beings compare the handwritten ‘A’ to some reference picture stored in their brain. Probably they know the characteristic features of an ‘A’ and their perception is not disturbed by arteiacts in the picture. The interference of artefacts in a picture will, however, destroy the correct classification by a programmed machine. which is probably not able to judge the importance of deviations from the preprogrammed standard features. The way people have acquired the ability to recognize pictures can only be by experience. By trial and error they have learned to perform certain tasks. Machines do not learn. they are preprogrammed. and if they can learn they are restricted to certain classes of preprogrammed methods of learning. The lesson seems to be that the capability oilearning is essential for more advanced and intelligent artificial machines. Another striking difference between machines and human beings is the ‘computation‘ time required for complicated tasks such as pattern recognition. Computers are extremely fast but it is hard to design machines that can recognize three~dimensional objects in real time, whereas humans. whose brains are composed of neurons switching about a million times slower than electronic components. can recognize old friends almost instantaneously. We know that computers perform their computations sequentially. step by step. whereas the human brain is processing the information in parallel. ». ,v I
  9. 9. 2 Introduction Figure l. l A handwritten capital ' The second lesson for designing more intelligent systems seems to be the use of the parallel processing of information. Man-made machines are built with a large number of different complicated functional building blocks; if one unit fails, the whole system collapses. The brain, however, is built out of ti large number of, at least from a functional point of view. almost identical building bricks: the neurons. Many units may be destroyed without significantly Changing the behaviour of the total system. This comparison between the behaviour and construction of artificial machines and the behaviour and physiological configuration of the human brain might give new ideas for developing more intelligent machines. Artificial neural networks are the results of the first steps in this new direction for intelligent system design. 1.2 The artificial neural network The building unit of a neural network is a simplified model of what is assumed to be the functional behaviour of an organic neuron. The human brain contains about 10“ neurons. For almost all organic neurons one can distinguish anatomically roughly three difierent parts: a set of incoming fibers (the dendrites), a cell body (the soma) and one outgoing fiber (the axon). For a simplified configuration see Figure l.2. The axons divide up into di| Terent emlings. each of which makes contact with other neurons. A neuron can receive up to 10000 inputs_from other neurons. The bulb-like structures where fibers contact are called synapses. Electrical pulses can be generated by neurons (so«ea| led neuron liringl and are transmitted along the axon to the synapses. When the electrical activity is transferred by the synapse to another neuron. it may contribute to the excitation or inhibition of that neuron. The synapses play an important role because their triiiinnission efficiency for electrical pulses from an axon to the dendrites (or s mas] ul’ other neurons can be changed depending oti the ‘profitability of that altc tlllllll. The learning ability of hnnian beings is probably incorporated in the facility of changing the transinissioii etliciency of those synapses. Donald 0. llebb was among the first who postulated | lll. ‘ iiieclmiiinn in his book Orgunizuzimi uf BL'IltIl'l4Ir (l‘)4‘)l‘. The artificial neural network 3 Dendrites Synapse Axon Soma Figure 1.2 Simplified configuration of an organic neuron X. Figure [.3 Artificial model of a neuron. ‘When an axon ofcell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased. ’ The change of the synaptic transmission efiiciency acts as a memory for past experiences. In a simplified artificial model ofa neuron (see Figure 1.3), the synaptic transmission efiieiency is translated into a real number w, by which an input x, is multiplied before entering the neuron cell. The number w; is called the weight of input x, . The absence or presence of a train of electrical pulses in a real neural fiber is modelled by a variable x, which respectively may have the value zero or one. In that case we will say that we have a binary artificial neuron representing the ‘one-or-zero‘ behaviour ofa real neuron. ‘I’: V"
  10. 10. 4 Introduction ln reality, neurons may fire up to about 100 pulses in a second. We can model this gradual change in pulse train frequency by a variable x, which may have any value between zero and one. In that case we have a continuous artificial neuron. In almost all artificial models ofinetfrons, all inputs x, are weighted by the synaptic transmission elficiency and are summed to one number s= Zx, wt. This weighted input 3 determines in some way or another the output value y of the artificial neuron. In a binary neuron the output y will be one (the neuron is firing) if the weighted input exceeds some threshold '12 and will be zero (the neuron is silent) if the weighted input is below the threshold. In a continuous artificial neuron, the output y may be some monotone increasing function (the frequency of the pulse train on the outgoing axon is gradually changing depending on 5) of the weighted input s. The model of an artificiatneuron outlined above was first introduced by the neurophysiologist Warren McCulloch and the logician Walter Pitts in I943. In a famous paper by the mathematician S. C. Kleene in l95l, it was shown that with the artificial neurons introduced by McCulloch and Pitts. one can build a system that behaves in the same way as a computer. Although it is important to know tltat artificial neural nets are not inferior in their computation capabilities to computers, it is of no practical use to mimic computers. The benefits of art artificial neural network are mainly the results of the modifiability of behaviour by changing the weights iv‘ in a learning process. The learning behaviour of artificial neural nets was first treated extensively in a book by Frank Rosenblatt in 1962, Principles nf Nt. 'llI‘t)AfyIltlI)lf('S. He introduced a learning algorithm by which the weights can be changed such that a desired computation was performed. The wave of activity on artificial neural networks in the mid-19605 was. however. challenged in 1969 by Marvin Minsky and Seymour Papert, who showed in their book Purueplruns that some simple computations cannot be done with :1 one-layer neural net. and doubted that a let ning algorithm could be found for multi-layer neural networks. At that time many scien left the field of artificial neural networks. A second upheaval took place in the mid-1980:‘ when several people fottnd a learning algorithm, called the htu'k-pm/ mt/ uriun flfqtlfllfllll, that could adjust the weights in multi-layer neural nets. About that time also new types of neural net with dynamic behaviour were introduced. We mention the neural net of Hoplield (I982) (not treated in this book) and the self-organizing neural net of Kohonen (1982). Dynamic neural nets are characterized by feedback: the output ofa neuron depends. after some delay, on its own output because of the fully interconnected structure of the neural nets. The binary Perceptron we will discuss in Chapter Zand the contiuttous Perceptrun of Chapter 3 are not dynamic systems: after learning. the output ytt) til the neural net will only depend on the actual input wt and not on previous inputs. However‘ the input—output behaviour F: ytr): Flx(t)}, learned by the neural net in the training phase. will depend on the sample input output behaviour of the training set. For a dynamic system the actual output ytr) depends not only on the actual input xu) but also on the actual state q(I): -tt)= FjqIt). xtnl, whereas the state tlepcntls tut The artificial neural network 5 inputs in the past. There exist neural networks that can learn dynamic behaviour from sequences of inputs and corresponding sequences of outputs (Veelenturf, 1981). We will not discuss these networks because they are a typical example of solving problems by using a neural network, whereas. except for special cases, there are more efficient methods of finding the solutions with conventional methods (Veelenturf, 1978). This observation leads to the warning that one must be aware that using neural networks is not a panacea; it is frequently better to have recourse to more conventional methods. t. .,. 1'
  11. 11. THE Bllslt”ilY PERCEPTRON 2.1 Introduction The behaviour of an artificial neuron is inspired by the assumed behaviour of a real neuron in organic neural networks. A simplified model of a real neuron is composed ofa cell body or some. a set of fibers entering the cell body, called the dendrites. and one special fiber leaving the soma. called the axon. The dendrites transmit trains of electrical pulses towards the soma and the axon conducts trains of pulses away from the soma. The axon terminates by branching into many filaments. The filaments end in bulb-like structures called synapses that make contact with dendrites or somas of other neurons. The transfer of electrical pulses from the final filaments of some axon to the dendrites or soma of another neuron depends on the synaptic trunxtnirsion efliciency, represented by the variable w. If the synaptic transmission etficicncy is positive. the synapse is said to be excimtory, if negative, the synapse is called inhibitory. The positive or negative transmission elficiency may vary between small and large values. Only when the sum of ‘synaptic weighted‘ incoming pulses is greater than some threshold is a train of pulses generated by the some and transmitted by the axon (see Figure 2.1). In organic neural tissue the pulse frequency may vary between a few pulses per second up to twenty pulses per second. An additional simplification is to disregard the frequency of the pulse trains and to consider only the presence, represented by the number 1, or absence represented by the number 0. of a pulse train. This simplified model of a real neuron is called the one-or-zt-m behaviour of a neuron. This simplified model of a neuron can easily be simulated by an artificial neuron (see Figure 2.2). Dendrites are represented by input lines and a variable x, represents the presence [xi(t)= 1] or absence [x_(t)=0] ofa pulse train on fiber i at time t. Every artificial neuron has one output line representing the axon of the neuron and the presence or absence of a pulse train at an axon is presented by the value 1 or 0 of the variable y(t). There will be one special input line with a constant input x, ,= l. and a weight W0. This constant input. x0: 1. and the weight w‘, realize a threshold equal to ~w0. When the ‘weighted' sum of incoming signals is greater than the fIlN'S, llIltl T= —w, , the output y will become I after a delay I. The input-output behaviour of 6 Introduction 7 % Synapses ‘ xu -1 xi W‘ "Va X o“* W V Figure 2.2 Artificial model of a neuron an artificial neuron is now specified by: 'W+Y)= =l iii w, xi(t)> —WO :0 otherwise The variable w, is called the weight of input line i and represents the synaptic transmission efiiciency of the synapse between the final filament of at neuron and the dendrite itor the soma) of a particular neuron. The threshold T= —w, ,, the weights w, and the delay I are real valued. If there is no feedback in the neural network we may take r=0, and the time dependency of x, - and y can be ignored. So the previous 3- 7.
  12. 12. 3 The binary Perceplron Figure 1.3 Electronic implementation of an artificial neuron formulation of the input—output behaviour can be replaced by: = l it'Zw, .i<, >—w, , =0 otherwise An artificial neuron can easily be implemented in a simple electronic circuit (see Figure 2.3). Those acquainted with electronics will understand that the transistor will be open if: 5 + 2 > — Us R1 R2 R: lfthe voltage E, represents x, , E, represents X1, and A U, /R1 represents the threshold -we, then we obtain with l/ R, = w, and 1/R2=wz that the transistor is open ii? w, x,+w2x2> —wo Networks composed oflayers olintereonneeted artificial neurons have been studied extensively by many authors. The analysis of neural networks is attractive been . e all the building units, the neurons, are the same and the transfer function of such :1 unit is quite simple. More important, however. is that we can alter the behaviour of a neuron by changing in a learning process the weights w, in the input lin Changing weights is the artificial counterpart of the adaptation of the synaptic ellieieney in real organic neural networks. Before examining this learning behaviour of ii neur. il network, we consider the ‘zero-or-one behaviour‘ ofjust one single artilieizil ‘binary’ HCUYOII. With a single neuron, for example, we can realize some restricted class of ]7fL‘tlll'tll4’ lugici Consider the statement: ‘John is going out for a walk if and only if the sun is shining or if it is cold and the wind is blowing west. ‘ The predicate ‘John is going Introduction 9 Table 2.1 X1 X2 X3 y 0 0 0 0 0 0 1 0 t 0 1 0 0 ‘ 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 xosl x, -1 +2 1 +1 XOR. +1 / % Figure 2.4 Neuron illustrating Table 2.1 out for a walk‘ is only TRUE if the conditions mentioned are TRUE Now we can represent the ‘truth value’ TRUE by the number 1 and FALSE by the number 0. We represent the truth value of the predicate ‘The sun is shining‘ by x, , the truth value of ‘It is cold’ by X2, the truth value of ‘The wind is blowing west‘ by X34 and the truth value ol"John is going out for a walk‘ by y. With this notation we can enumerate all possible situations in a simple truth table, as shown in Table 2.1. If we now consider the values of x, . x, and x, as the inputs of a single neuron and y as the output, we can select the weights wo. w, , W2 and it‘; in such a way that the output behaviour of that neuron yields the truth value of the predicate ‘John is going out for a walk‘ (Figure 2.4). Methods for finding the appropriate weights analytically. or by a learning process. will be discussed later. Pioneers in this field of research. like Rosenblatt (1962) and Minsky and Pzipert H969]. investigated neural networks with the aim of using such networks mainly for
  13. 13. 10 the binary Perceplmn ‘Vx‘ QCCCQVLCC ggg, QCCQCCCZ §‘ Figure 2.6 Classification of sixteen patterns by a neural network s. For this reason they called those networks Perceptruns pattern recognition problem S! ‘ we will can the (from perception). In honour of Rosenblatt. who used the term fir networks discussed in Chapters 2 and 3 Perceptrons. > As an example we consider a pattern recognition problem. Some pattern is projected onto a grid of small squares A variable x, is assigned to each square. The variable . - i - - ' ‘f h x, will have the value 1 if the pattem is covering that particular square, and 0 i t e pattem is not covering that square (see Figure 2.5). The values of the variables x, _ . - k constitute the inputs of a binary neural network. The output of the neural r: e(t)wor will classify patterns as belonging to some predefined class (y= 1) DI 00‘ U’- )- For example if we have a very small grid of four squares and one single neuron (see Figure 2.6)‘ one may wish to classify the sixteen dilferent artificial patterns (see Introduction Tl Table 2.2 x. xi X, x. y x1 xi xi x. y 0 0 o o o 1 o 0 o o 0 o o l o 1 o 0 1 o 0 o 1 o o 1 o 1 o 1 o o 1 1 1 1 0 1 1 1 o 1 o o 0 1 1 o 0 1 ‘ o 1 o 1 1 1 1 o 1 1 o 1 1 o o 1 1 1 o 1 o 1 1 1 1 1 1 1 1 1 Table 2.2) as whether patterns of at least two black squares are connected (i. e. the black squares are adjacent), y= l. or not. y=0. Although there are many pattern classification problems that can be solved with a single neuron, we will demonstrate in the next chapter that there exists no set of weights w, w, , w, , w, and w, such that Table 2.2 is realized by a single neuron classifier. lt turns out that we need a two-layer neural network with at least two neurons in the first layer. We can see directly that the problem can also be solved with four neurons in the first layer (one neuron for detecting that x, and x, are both 1, one neuron for detecting that x, and x, are both l, one neuron for detecting that x, and x, are both I, and one neuron to detect that X, and x, are both 1), andione neuron in a second layer to detect that least one neuron in the first has an output of 1. One might suspect that a single neuron is not able to solve a classification problem if there are a great number of input variables. There are, however, problems with only two input variables that are also not solvable with a single neuron. Consider, for example. the Boolean ‘exclusive-or‘ function: y = x, ®x, , i. e. the output of the single neuron must be 1 if and only if x, =1 or xz=1 but not both. We will see in the next chapter that we cannot solve this problem with a single neuron but we will demonstrate, on the other hand, that any Boolean function can be realized with a two-layer Perceptron. This example indicates at the same time a third application area for the use of binary Perceptrons: the realization of Booleanfunclioris. Because we can realize any Boolean function with a binary Perceptron and because every neuron can be implemented with an electronic circuit, we have the fourth application area: swilching circuits. In a subsequent section of this chapter we will study two-layer binary neural networks composed ofinterconnected artificial neurons without feedback connections between neurons. Different Boolean functions can be realized in parallel, eg. with the two-layer neural network given in Figure 2.7. The neural net of Figure 2.7 elassifies simple patterns consisting of three pixel points in three classes as specified by Table 23. Ha pattern p= <plp2pJ> is a member of the class K, = {(000), (001), (100), (111)), 43
  14. 14. 12 rite binary Perczptron W= +1 Figure 2.7 Neural net for classification with two-dimensional output then y, =1 and y, =0. If the pattern is a member of the class K, ={(0l0), (011)), then y, =0and y, = I. lfthe pattern is nota member ot“K , orK, ,theny, =0and y, =0. 2.2 The performance of a single-neuron binary Perceptron In the previous section we saw that a single neuron performs a kind of ‘weighted voting‘ on variables x, : the output y of the neuron will be 1 if and only if w, x,+w, x,+ +w, ,x, , is greater than some threshold T Example 2.1 Consider the balance of Figure 2.8. At equally spaced points there might be objects with some weight g, at the balance pole. At the lcl't~hand side there is one fixed weight go attached to the balance at a unit distance from the suspension point. We use the variable X, to indicate whether (x, = I) or not (Xk=0) if there 15 an object placed at distance k from the point of suspension. Now ‘The balance will tip to the right‘ if and only it‘: Zl<9.xi>yo or with kgk replaced by wt, go= TI and when the predicate ‘The balance will tip to The performance of 4 singlemeuron binary Perceptron 13 Figure 2.8 The mechanical equivalent of a threshold function the right’ is replaced by the binary variable y, we obtain: y= l if and only if): w,x, ,>T n We will now investigate the properties of threshold functions like the one used above. We define first: a function y: /(x, , 2:1,. . . , x, ,) is called a binary linear threshold function with respect to the binary-valued variables x, , x, , . . . , x, , ifthere exist a number T and a set of numbers {w, , w, ,.. .. w, ,} such that y=1 if and only if 2w; xi> T. We usually will drop the adjective ‘binary‘ and, although important, we will frequently drop the phrase ‘with respect to the binary valued variables X| , x, ,.. .. x, ,'. With the use of the stepfunction S(z) defined by: S(z)= l if z>0 and S(z)=0'ifz<0 we can equivalently say that y= f(x, ,x, ,.. .,x, ,) is a linear threshold function if y= S(}'. ‘ w, x1— T). If the weights w, - constitute the components of a so-called weight vector w and the variables xi are the components of the input vector x, we can also write a linear threshold function as y= S(w'x— T), with w‘ the transpose of the vector w. Note that a linear threshold function is not a linear function in the ordinary sense. because for a linear threshold function we have f(atx); éat_/ (xj. In the introduction to section 2.1 we have demonstrated that a linear threshold function can be realized by a single-neuron binary Perceptron with the threshold T= Aw‘, realized by a constant input x0=l and a corresponding weight wo. If we introduce a so-called extended weight vector W with vv‘= [w0, w, , w, ,.. ., W, .]. and a so-called extended input vector it with it‘= [l, x, , x1, . x,, ], then we can compactly write a linear threshold function realized by a single neuron Perceptron as y = S(v‘v‘)‘t), Now we concentrate on the class of logical functions that can be realized by a single-neuron binary Perceptron. If y= f(x) is a logical function of two variables x1 and x, we can have sixteen different functions. (NB: there are 2’ difierent arguments and for each argument the function value can be either 1 or 0.) Of those sixteen functions, fourteen can be realized by a single-neuron binary Perceptron. The two functions that cannot be realized are specified by Table 2.3. The first function is called the exclusiv2—orfunction and the second the identityfunction. We will prove now that the first function is not a linear threshold function and
  15. 15. 14 The binary Perceplron Table 2.3 *1 X1 . V Xi X: Y 0 0 0 I 0 0 I 0 I I 0 0 l l 1 l thus it cannot be realized by a Singlc»ncuron binary Pcrceptron. Assuming. however. that we can realize the exclusive-or function. then there must be weights wo, w, and w, such that: For the first argument [0, 0] we must have: w0+w, x, + wzxz $0 tltus: wu <0 For the second argument [0, I] we must have: w, ,+w, x, +w1x1>0 thus: wu+w, >0 For the third argument [1,0] we must have: wo+w, x, +wzx, >0 thus: w°+w, >0 For the fourth argument [1, 1] we must have: w, ,+w, x, +w2x2$0 thus: w, ,+w, +w, <0 One can easily verify that the four iiicquztlities for the weights cannot be satisfied simultaneously. This completes our proof. Our conclusion that the exclusivc—or ltinction cannot be realized by a single-neuron Perceptrori also becomes clear when we consider the realization of the exclusive-or function as a classification problem. We have two classes of points in at two-dimensional input space (see Figure 2.9). For one class of points ((0. l). (l.0)} the output of the neuron must be I, and for the other class {(0,0), (I, l)} the output must be 0. Thus for the first class we must have w, ,+w, x, +w, xZ>0, and for the second class we must have we + w, x, +w, x, $0. The set of points for which w0+w, x, +w, x2=0 represents a separating line in the two-dimensional input space. On one side of this line we will have for every point (x, . x2) that w, ,+ w. x, + | |': .‘(_! >0 and thus the output of the neuron will be equal to I. On the other side of the line we will have w, ,+w, x, +wzx, <0 and the output of the neuron will be 0. Now one can easily check that we cannot l0C1lIC ti line between the two sets of points that ication problem. must be separated; thus there exists no solution for our cliis. The perlormance of a single-neuron binary Perceptron 15 (0.0 (1.1) I Figure 1.9 Exclusive-or as a classification problem Table 1.4 XI *2 X: Y I 0 0 0 0 2 0 0 . I 0 3 0 I 0 0 4 0 I I I 5 I 0 0 I 6 I 0 l I 7 I I 0 I 8 I I l I I Fortunately we can show in a later section that the problem can be solved with a two-layer network. We will now discuss the more general situation of the realization of logical (or binary) functions with ri arguments with a single-neuron Perceptron. There are 23' difierent functions with ri arguments. A large number ofthese functions are not linear threshold functions with respect to the variables x, . x1,, .., X, , and thus cannot be written as y= S(E w, -x1>— T) and thus cannot be realized with a single~neuron binary Ferceptron. We will give an example of how to find the linear threshold function (if it exists) given the truth table of a logical function with three arguments, Example 2.2 Let a function be specified by the truth table given in Table 2.4(see also Figure 2.4). We want to obtain a function ofthe fonn y = S(2 w, x, ~ T), or with the threshold Treplaced by the weight —wo we want to find the expression y= S(w0+ W! “ + wzxz + W3“)
  16. 16. 16 The binary Perceplron For the successive arguments we must have the following: w°<0 wo+w, <0 we-t-M/1S0 ‘ . w, ,+w, +w3>0 I wu+w, >0 wu+w, +w, >0 w, ,+w. +w, >0 w9+w, +w1+w, >0 . °°. ".°‘1~". “l-". "’! " From (I) and (4) we conclude wz +w, >0_ Take W1: I and ’J* I, then from (Z) and (3) we conclude was — I. Take wq= — I. From (5) we infer w, = 2. With this selecltitm of weights all inequalities are satisfied and we obtain the following linear thrcs o function: y—S(-l+2x, +x, +Xil- I From this example it becomes clear that there exists. within the bounds given by the set of inequalities a certain amount of freedom to select the weights For iiistiiitce we could select as WC“; W0: _2‘ W‘ :4‘ W: =2 and w_=2; another selection could bc‘ iv = —| w =4 w, =l and w, = 1. We will return to this topic in Section 23. -Asostated before. not all logicalfunctions are linear tlireshold functions wiith to the variables x. . x; ... ..x, ,. It is not easy to determine whether a given unction is a linear threshold function or not. At present there is only one more or less practical method by which this can be done, and that is by determining whether or not the set of inequalities associated with the logical function contains a contradiction. Example 23 Consider the logical function y= f(X, ,Xz. X_il dcfimfl b)'_ J/ =l if and 001)’ if ll” number of I in the argument is odd. This function is known as the purity problem and is specified in Table 2.5. Table 2.5 X. X2 n i I 0 0 0 0 2 0 0 l I 3 0 1 (I I 4 0 I I 0 5 I 0 I) i 6 I 0 I 0 7 I i 0 U S I I l I Tlle performance of a single-neumn binary Perceplron 17 We want to find an expression of the form: y= S(wo+ w, x, + w, xz+w, x,). For the‘ ‘ successive arguments we must have the following: . wo$0 wu+w, >0 wu+w1>0 w0+W; +W3<0 I w0+w, >0 w. ,+w, +w, <0 wa+w, +w2<0 wu+w, +w2+w, >0 . °‘>'. °‘E"P! -*’. N:-‘ In this case we can see immediately that the parity problem cannot be solved with a linear threshold function. because from (I) and (2) we conclude that w, >0 and from (3) and (4) we conclude that w3<0; hence we have 2| contradiction. In the general case of n variables we have to investigate 2” inequalities. We can save ourselves a great deal of efl'ort if we eliminate redundant equations and simplify expressions by using the following set of properties. Properties of lI1é'qU-I/ IUPS a>0 and b>0—_-»a+b>0 est) and bs0=~a+b<0 a>0 and a+bS0=b<0 a<0 and a+b>0=ab>0 a+b>0 and a+c$0=~b>c Za. >0 and / ‘.>0=~£/ ‘.a, >0 ): iz, <0 and ). >0=2}. u,. <0 >'S". V'. “!-“! ":‘ From rules (I) and (2) we can derive a property that we can sometimes use to check whether a logical function can be a linear threshold function or not without writing down the set of all inequalities. Let a in rule (I) represent the sum of some set of weights associated with some input vector x, with f(x, ): 1. Thus the first inequality in rule (1) with a = v’v-fr, > 0 must hold. Let b in rule (I) be the sum of weights associated with some vector X2 with f(x2)= I. Thus the second inequality in rule (1) with b= v”v-ft2>0 must hold. According to rule (I) we must have v‘v-it, +ifv-i1>0. Assume we have for the zero vector 0: f(0)=0. This implies w. ,$0 and thus (—wo)+‘v-ft, + v‘v'i22 >0. Let x, and X, be vectors with no Is in the same position, we write x, nx 0. Let 1 be a vector obtained from the two input vectors X, and x2 such that z, » I if x, ,=l or x, =1 otherwise zi=0, we write z= x,ux, . Vector 1 has a corresponding inequality v’v‘i (~ wo) + W-ft, + v‘v-£2. Because (— wa)+ v‘v-ft, + v'v~i‘t1 > 0 we must have [(2): 1. The same kind of reasoning holds if/ (x, )=0, f(x, )=0 and f(0)= I. In that case / (z)=0 must hold. Thus we can write the consistency property ofthe binary linear rhresholdfunciion: If the logical function y= f(x) is a binary linear threshold function BAY ZOLTAN Allulmuoti Knuth: Aljpltvfiny lnfiiutikai an Gyinhlochnikzi hug it. .: :I = - 4
  17. 17. Ill The binary Percepimn Table 2.6 X1 X: Y .505 o l 'b. 1 o I ‘ O 0 I l 1 with respect to x, , x, ,.. ., x, , and / (x, )=u, f(x, )=u, /(0)= iI, then f(z)= u must hold. with z= x,ux, and x‘nx, =0, and u= l (t7=0) or u=0 (ti: l). Example 2.4 For the identity function we have Table 2.6. From Table 2.6 we find: f(0,0)= l and thus wo>0 f(0. l)=0 and thus w0+w, $0 / [1, 0)=0 and thus wu+w, so For f to be a linear threshold function we must have f(l, l)=0, which implies wo+w, +wzS0. However. f(l. l)= l. and thusfcannot bea Iinearthrcshold function with respect to x, and x, . 1 Many logical functions of n arguments cannot be realized by at single-neuron Perceptron. This also becomes clear when we consider the determination of ti logical function as a classification problem. We have two classes of points in a n-dimensional input space [see Figure 2.10) for the parity problem as presented in Example 2.3. For one class of points A = ((0. 0. l), (0. l, 0). (l. 0, 0). (1, l. l)) the number of ls is odd tind the output ofthe neuron must be I. and for the other class B = {(0. 0. 0). (0, I, I), (l, 1.0). (L0. 1)} the number of Is is even and the output must be 0. Thus for the first class we must have wo+w, x, +w, x,+w, x,>0 and for the second class we must have wo+w, x, +w, x2+w, x,<0. The set of points (xl. xz, X3) for which wo+w, x, + wzxz + w, x3 =0 represents a two-dimensional separating plane H in the three- dimensional input space. On one side of this plane H we must have for every point Ix‘. x2, x, ) that w, ,+ w, x, + wzxz + W, x, >0 and thus the output of the neuron will become equal to I. On the other side of the plane H we want to have w, ,+w, x, +w, xz+ w, x,<0 and the output of the neuron must be 0. Now one can easily check that we cannot locate it plane between the two sets of points that must be separated. thus there exists no solution for our classification problem and thus the parity function cannot be represented by a linear threshold function‘ Or equivalently one can say: both sets of points ure not linearly scptirahlc. In the ri-dimensional case we must have 21 tn— I)-dimEl1Si0|"l2IiSflfldrlllillg'l_i'[Vt'I'[lItltlt' The perlormance of a single-neuron binary Perceptron 19 I-‘igureZ. ll One-dimensional separating hyperplane H in a two-dimensional space H inlthe ntdimensional input space X if the logical function is a linear threshold function with respect to x, .x, ,.. ..x, ,. The separating hyperplane is defined by wlxl +WzX; + + w, x,, = ~w‘, . The weight vector w= [w, , w, ,.. ., , ] is orthogonal to the separating hyperplane. This becomes clear when we take two rpdimensional input _ vectors x, and x2 located on the hyperplane. Figure 2.1] shows the two-dimensional case. For these vectors X, and x, we have that w'(x, Ax2)=0 and thus w and the hyperplane are orthogonal. The separating hyperplane H divides the n-dimensional input space X into two half~spaces, the region X where Zw, .x, > —wD (Le, y: 1) and the region X- when
  18. 18. 20 The binary Perceplron Fixture z. t2 l)cettIttpt)sition of the Input vector X Z It', .', S —w(, (Le. y. =0J. _ Sittce a vector x in the region X ' will give w‘x> — tv, ,. tlte weight vector w points into the region X’. It is | l(l that x is on the [m. 'iIll't’ / lteI1yp¢'rpIttItt' ll if x is In X * and x is on tltc ne ttive ‘Me of II if x is in X '. Tlte distance from the origin to the hyperplane II is equal to the projection of a vector xl in H lt. e. w'x = — two) on the unit vector w/ wl normal to H. Thus tl= w‘x/ lwl. or with w x 7 — wu we find for the distance along w from tlte origin to [lie hyperplane: ‘The distance from the hyperplane II to an input vector x is prttpttrliomtl in _w x + wq. llie easiest way to see this is by decomposing the vector x into a component in the direction of w and another component x, , orthogonal to w (see Figure 2.| Z): By forming the vector product w'x and noting that w' t) we obtain for the length of the component x along w: . 2:3 lwl By subtraction of the distance :1 from the origin to tlte ltyperplane we ohtzttn for the distance 6 from the hyperplane to x: 5 = w‘x + W" twl .1.» . ,/‘ Equivalent linear threshold tunuions 21 We finally mention that one can prove (see Cover, 1965) that itt the n-dimensional case the number of linear threshold functions is equal to: CW: Z <2". —t) , :o t For partial functions, 2" must be replaced by the ttumber of samples. 2.3 Equivalent linear threshold functions This section on equivalent linear threshold functions can be omitted on a first reading of this book. For a more profound understanding of the l-'trning capabilities of a binary Pcrceptron. however, one must know that in general :1 binary Perceptron yields an infinite number of solutions to the same problem. . In the next chapter we will give tt learning procedure of how to find frottt a set of samples ofa logical function. a single-neuron binary Perceptron that will realize that logical function. if the logical function is a linear threshold function with respect to the variables . ', ..', ,.. .,x, ,. It will turn out that the weights of the linear threshold function. obtained after learning by the neuron, will depend on the particular sequence ofupplied samples. Different threshold functions can, however, realize the same logical function. known as equivalent linear threshold functions. . In addition. it is worth determining which features of linear threshold functions are essential and which are arbitrary. It frequently occurs that the descriptions of linear threshold functions are different, .e. the weights iv, are different. whereas the realized logical function is the same. We will now give four different simple theorems on equivalent linear threshold functions. The first theorem is important, thottgh its proof is trivial. Theorem 2.1 lfy= S{)Z w, .‘<i + W0) is a linear threshold function, then y’ = S()I / Cw, -.', +Zw0) with / ‘.>0 is an equivalent linear threshold function. Example 2.5 Let y= S(x, +. '2—x, ) and thus w, = 1, w2:l and v, = — l, then with / ' :2 the linear threshold function y S(2x, +2x1v--2x3) is equivalent to y. I Theorem 2.2a If y= StZtv, x,+tv, ,) is a linear threshold function and 2ir[V, +»t-l, is a linear function with respect to the binary variables V| ,V1,. ... .‘" such that if _i'>0 i. ..-"
  19. 19. 22 The binary Perceplron Table 2.7 X1 X: X: y y’ y” 0 .11’ o 0 -1 0 o o‘ l 1 o -1 o 0 1 1 0 1 o 1 0 1 1 0 o 1 1 o o 1 o 1 1 o 1 o o 0 1 1 1) 1 1 1 1 1 1 1 1 1 then y= l, and if y'=0 then y= l or y=0, and if y’<0 then y=0, then y"= S(£(wi+lw[)x, +wD+}. w§, ) with 1.20 is equivalent to y= S(2 w, x,-1-wo). Proof Because we add to the value of the argument of the step function in y a positive or zero value when y= l and we add to the value of the argument of the step function in y a negative or zero value when y=0. it will be clear that y= y“ for all the values of the argument. QED Example 2.6 Let y= S(x, +. ic2—x, ) as specified in Table 2.7. Let y‘= x,+x, —l as specified in column 5 of Table 2.7. The conditions on y’ are satisfied so the linear threshold function y”= S(3x, -1-3x, —. itJ -2) (with / l=2) is equivalent to y. The trivial complement of the previous theorem is as follows: Theorem 2.2b If y= S(2 w_1x, -1-wo) is a linear threshold function and y’=2w[x, +w[, is a linear function with respect to the binary variables x, ,x2,. .., x,, such that if y’<0 then y= l, and if y'=0 then y= l or y=0,-and if y’>0 then y= O, then y"= S(2(w_+ Aw’i)x. + w, ,+ }. w[, ) with ). <() is equivalent to y= S()2 w, x,+w. ,). For the next theorem we need some auxiliary concepts and lemmas. We first define X‘ as the set of argument values of :1 logical function for which y: l: in the same way we have X‘ as the set of argument values for which y=0. Lemma 2.l l. lfy = S(2 w, .t, -1-112“) isa linetir threshold function. then y’ = S(E w, x, + w. , — A ‘) with OSA‘ <min(2ii1.. x,+ii1,, ) over X ’ is an equivalent linear threshold function. If Equivalent linear threshold functions 23 y"= S(‘/ I w, »x, +w, ,—A‘) with 0>A‘ >max(Z wix; + W9) over X‘, then y” is also an equivalent linear threshold function. Proof When we subtract from the argument of the step function in y a positive constant A less than mint): w_»x, »+wo) over X‘, then the argument of the step function will remain positive for all elements ofX’ and the argument of the step function will be negative or zero for all elements of X‘ thus y will remain the same for all elements. If we add to the argument 1: positive constant value A smaller than or equal to —maxf2 w, x,+w0) over X‘, then y will also remain the same for all values of the argument. QED Example 2.7 The linear threshold function y= =S(x, -1-x, —x, ) is equivalent to y’ = S(x, + x; — x, ~05) because A’ =0.5 and 0s0.5<min(2 w, x,+w°)= l over X‘ (see Table 2.7). Note: A‘ = 0, ‘ Lemma 2.2 If y= S(Z w; x,+w°) is a linear threshold function w. r.t. x, .xz, ... .x, ,, ‘then y'= S(Z(w; —6i)x, +wu) is an equivalent linear threshold function if 0<6i< (1/n)min, ()2 w‘x, -+ we) over X‘ or if 026,-zfl/ n)max, ()2 w, x,+wo) over X‘. Proof If, in the worst case. for all x, we have xi: 1, we subtract in the first case from the argument of the step function in y the constant 716, with OS 716, < min‘ (2 w, -xi + W“) over X‘. According to Lemma 2!, we obtain in that case an equivalent linear threshold function. The same holds if 02§, >(l/ n)max_-(Z w, -x, v+ wo) over X‘. QED Example 2.8 The linear threshold function y= S(x, +x, —x3) is equivalent to y= S[(l—1/6)x, + (1-1/6)x1—(l —1/6)x, ] because 4351/6 and 0s l/6<l/3mini(Z wix, +wo) over x’. I Theorem 2.3 lfy= S(Z iv, x,+ we) is a linear threshold function, then there exists a linear threshold function y’ =2 w}x, »+ wl, such that all weights are integers.
  20. 20. 24 The binary Perceptron Proof If all weights in y are rational numbers, we can form the product D of all’ denominators ofthe weights andmultiply all weights by D. All weights will become an integer and the obtained threshold function y’ is, according to Theorem 2.1 equivalent to y. lfa weight w, - is a ‘rem number, we can replace weight wi. according to Lemma 2.2. by a rational weight w} such that w; =ii', -6, and rational weights can be replaced by integer weights. This completes our proof. QED 2.4 Learning a single-neuron binary Perceptron with the reinforcement rule Although zi single-neuron binary Perceptron is not of great practical use. because only a few of the logical functions are linear threshold functions with respect to the variables . ', . . '; ,.. ..x, ,. we caii nevertheless gain inueh understanding of more complie ed networks by investigating the learning beliaviour ofone building unit, In Section 2.2 we demonstrated how to determine the weights of a siiigle-neiiron binary Perceptron from a set of inequalities. Now we iiives gate how we adapt step by step in a learning process the weights of a neuron in order to identify some logical function. In fact we can use the learning proe an algorithm to solve a set of inequalities. We assume that the function to be r ililed by the Pereeptron is a linear threshold function. At a given step of the learning process we have some extended weight vector v‘v= [‘ii'0. w, ,.. .,w, ,]; the output will be correct for a subset of all argumeiits of the function to be identified, and for the remaining arguments the output will be wrong The set of arguments for which the target output is equal to 1. whereas the actual output is equal to 0, will be denoted by T, ,‘. the set of arguments for which the target output is equal to 0 and the actual output is 1 WI“ be denoted by 'I' . The arguments . ', ,xz, .,. ..x,1 ofthe function y will be extended with the constant iiiternal input x. ,=l of the neuron. It will turn out that we have to change the weights proportional to the elements of T; and negative proportional to the elements ol"l‘, ;. Example 2.9 The function y to be realized is specified in Table 2.8, Note that in Table 2.8 the extended inputs [‘0, x, ,.'1] with x, ,:l are given, where‘ s y is ii logi il function of X, and 3:2. The neuron in the initial learning state( it step A =0) has the weight vector M0): [ti-9(0), iv, (0). |'2(0)]' = [05, 1, — 1]‘ [see igure 2.13). For the output _i'(0) of the neuron at step k=0 we have (see Table 2.91] the following: y(0J= S[wo(0)+ w, [0)x, + iv; [()), '. ] = S(0.5 + i", — xi) We see that for the extended input vectors [l. (l. ()]‘ and | ,l. l. t)]' the output ix wrong. The only way to improve the output for the vector [1, ti. U]‘ is by tleereasiiig Learning a single-neuron binary Perceptron 25 Table 1.8 ‘o ‘t *2 1’ 1 O 0 0 1 l 0 1 0 0 1 1 0 0 1 i 1 1 1 1 1 *2 Figure 2.13 Single neuron with initial weights the value of w, ,. So we can change the weight vector i‘v= [Wm w, , w, ]‘ by an aniount AW proportional to — l[l, 0, 0]‘, i. e. minus the first extended input vector. To improve the output for the vector [1, 1, 0]‘ we have to decrease the values of in, and iv, So we can change the weight vector by an amount AW proportional to A l[l. l,0]‘. i. e. minus the third extended input vector. We can take the proportionality equal to 1 and add both increments AW. Thus we obtain Aw : [ -2, — 1, 0]‘. The new weight vector becomes W(1)= [l‘0(1), It‘, l1), t’: (l)]l= [— 1.5, 0, v 1]‘. Because: y(1l= S[Woi1)+Wi(1)Xi +‘V2ilJ-V2]= Sl‘1-5’-“zl we observe that now we obtain the wrong output only for the input vector [1, 1, 1]‘. To improve for that argument the output of the neuron we have to increase wo, w, and wz. Thus we can take Aw proportional to the corresponding misclassified vector [1, 1, 1]‘. We can proceed in the same way as above and after a finite number of steps we will obtain a correct response of the single-neuron Pereeptron, ifthe original function is a linear threshold function with respect to x, , x2,. .., x, ,. In Table 2.9 the results for six learning steps are given. The final linear threshold function is equal to: y= S(—l.5-i-X, -i-X2) I Let in general X‘ be the set of argument values with target output y: I and X’ the set of argument values with target output '= (). For a given value of the weight vector W, a subset T; of X‘ and a subset T; of X’ are misclassified.
  21. 21. 26 The binary Perceptron Tahle2.9 Step w. , w, W, r* T’ 0 w 0.5 1' 5 -1 - [1,o. o], [1,1,o] Art -2 "- 1) 1 w -1.5 iv -1’ [1,1.1] - Aw 1 1 1 2 it -0.5 1 o - [1,1,o] Air -1 -1 o 3 w -1.5 o 0 [1,1,1] - M 1 1 1 4 w -0.5 1 1 — [1,t, o], [1,o, i] My -2 -1 -1 s W -2.5 o o [1.1,1] Art 1 1 1 6 it -1.5 1 1 -— — We will show that for a convergent learning process we can modify the weight vector W by adding Zcft, summed over i, eT, ,‘, where c>0 is a proportionality constant, and subtracting from W the value of 2572, for ileff. Because at each step we use the total set T, *uT. .' for modifying the weight vector we will call this way of learning globul learning. However, we can also modify at each step the weight vector with only one element of T. ,‘u‘F, ,'. This way of learning will be called local learning. The learning rule is called reinforcement learning for both cases. The proportionality constant c is called the learning rate. We see that the output of a single-neuron binary Perceptron at a certain step k depends on the actual input and on the weight vector v‘v(k) at step k. We can consider the value of the weight vector at step k as the slate of the learning system. This enables us to describe the single-neuron binary Perceptron within the framework of the theory offlnite sequential machines. A finite sequential machine is described by a state transition function 15 and an output function 1. Given the state v‘v(k) and the input x, the next state is defined by v‘v(k+1)=6(v‘v(k), it) and the output at step It is defined by y(k)= /l(t‘v(k), it). The behaviour ofa sequential machine can be graphically represented by a so-called state diagram. In a state diagram, states are represented by circles and transitions between states by arrows pointing from the actual state to the next state. Each arrow has a label indicating the supplied input and a label corresponding to the output. We can define the outputfunction 1. of the sequential machine corresponding to a single-neuron Perceptron as: / l(v‘v, )‘t)= S(w, ,+w, x, +w, x,+ +w, ,x, ,) In case of local learning and with the learning rate 1:‘ equal to I, we can define the learning a singlrneuron binary Pm. -eptron 27 . _ooo ‘c state transition function 6 as: w ifuw. x) is correct i5(w, x)= w+x ifxET; w—x ifxeT; Example 2.10 The function y to be realized is specified in Table 2.10. We use the extended input, i. e. with the constant . r,, =1. The function y to be identified is a logical function of . -, and X, The different input values will be denoted by pry. p. . 1), and p, . The neuron in the initial learning state (i. e. at step k=0) has the weight vector M0): [11-0(0). w, (0). w, (0)]‘ = [l. 1, 1]‘. The output is defined by: ‘ , v(0)= S(Wol0)+ Wi(0)Xi + Wz(0)Xz)=5(1+Xr +Xz) Only for input In, is the output correct, and so we remain in the same state only for that input. For input p. ,eT, ; the output is 1 and the next state is w(0)—p, ,= [0, 1, 1]‘. For input p, eT; the output is 1 and the next state is w(0)—p, = [0, 1, 0]‘. For input p, eT; the output is 1 and the next state is w(0)—p, = [0, 0, 1]‘ (see Figure 2.14). We can continue this process in the same way, with the final result as shown in Figure 2.15. I In order to find a weight vector for a correct realization of a logical function, we found intuitively that in the learning phase the weight vector must be increased or decreased with vectors proportional to the misclassified extended input vectors. We can see the same in a more formal way. In the case of local learning the adaptation at step k becomes: v‘v(k+1)= v‘v(k)+ei‘t, if 3‘t, eT; and w(k+1)= w(k)-ex. if a‘t, e‘f; In the case 2,51,: we have prior to adaptation that the inner product W(k)-xi<0. After adaptation we have w(k+1)-i. =w1k)-x. +r1i.11. So we add a positive number
  22. 22. 2s The binary Perception PuPuP: Pl F‘: l’al’; P: Fitgttre 2.I-t The initial state diagram of lixntnplu l. |t) Pal’-V: H9192 PaP. P1D, Dan», I 6 . imp, vamp, p. ,n, p, I . I ® ,1 Figure 2.15 The complete state diagram of Example 2.| t) Pal-‘J elfr, -|‘ to the old inner product, and the inner product is changed into the desired direction. In the case‘)‘t, eT; we have prior to adaptation that the inner product Mk)-t, >tl, After adaptation we have vwlk+ l)‘i, =t’vtk , —t; li, |’. So the inner product is ' changing into the desired direction. In cases of global learning the adaptation at step I becomes: In Wtl<+ l)= v'v(k)+t: tZ i, —2;2J) with 32,57‘; and s2,s'I", ; We can conceive >2 :2, — )2 ij : t. one mus silied correction vector C‘ of 'i'; . We lI; I‘C prior to adaptation that the inner product v‘v(k)-Est). After adaptation we have Learning a single-neuron binary Perceptron 29 V Mk + l)~t': =‘v(l<)-r‘; +elél1. Thus in the case of global learning the inner product is also‘ changing into the desired direction. We see that at each step the behaviour of the single-neuron binary Pereeptron is improved with respect to the misclassified input vectors at that step. However. in general after an adaptation step the set of misclassificd vectors is changed. so we have to adapt the weight vector for that new set of misclassified vectors. ‘ One may wonder whether we can always be sure that we will finally arrive at a state where the outputs are correct. It might be supposed, for instance, that we may never finish updating the weight vector because the process can enter a loop of states. However, we will show that when at a certain step a state (i '1 weight vector) Wikl is obtained. we will never return to the same state if there exists a solution, Theorem 2.4 During the learning process a singlc»neuron binary Pcrceptron will never enter the same state more than once if there exi. ' a solution space. i Proof Let t‘v(k+n) be a state reached from tit-tit) after :1 additional adaptations: we then have a sum or vectors added to Mk): t: t2i, —2i, ) with a2,eT‘, f4A. ,,, and t2,e'l"; W,, , with p= t0. I. .. . n—l) Let 5 la solntinn twutor) be a weight vector for which a correct solution is obtained. We consider the inner product ol'§ and v”vtk+n): §‘t's‘(L + lt)= §'V‘t‘Ut'l-i—l2[2§'i‘—X§'i]l summed over 72,e‘7A'; ‘,‘. ,,, and 92/ETQMV, with p= t0, n—l). For every i, eT, ‘,, k,P, we have §-i, >0 and for every ij5'l";4,, .,, we have s-R130. Thus §‘Tv(k + n) aé s-v‘v(k) and hence v‘v(k + rt) # tivtk). (It may happen that T: (,. ,,, is empty for all p and that §vi]:0, thus in that case s-t‘v(k+n)= §-Mk). However. in that case we can take another solution vector s’ from the solution space such that §’~ijaé0.| QED Although we will never return in the same state during learning, this does not guarantee that the learning process will stop because the number of states in the space containing incorrect states is infinitc. However, the quotient of the number of correct states and the number of incorrect states is finite. The subspace § of the extended weights space W containing all correct weight vectors is called the solution space. The solution space also contains an infinite number of weight vectors, as becomes clear from the following arguments. If W is a solution vector, then the inner product v"v-)'<>0 for all is)? ‘ and tiv-ir<0 for all is)? T The same holds for [hit with / f>0. thus an is also a solution vector, If W, and vivj are solution vectors. then one easily verifies that / },v‘t', +/J/ W, with / l, and l U
  23. 23. 341 The binary Perceplron "r/ / (-2.02) (-2.111 (~2,2.2) Figure 2.16 The solution space of Example 2.9 /3,>0 is also a solution vector. A space having these properties is called a convex cone. In Figure 2.16 we have given the solution space of Example 2.9. The quotient of the ‘volume’ of the solution space S and the volume of its complement W—§ is finite; thus if we could select weight vectors randomly we would have a finite probability of selecting a correct solution vectorr In the learning process we do better because we never select a weight vector that has been chosen previously (Theorem 2.4). and moreover after each adaptation we move with the new weight vector in the direction of the solution space, as should become clear from the next example. In our previous discussion it should have been noticeable that we did not build our theory explicitly on binary valued input vectors. It turns out that linear threshold functions can also be functions of real-valued input vectors. In the next example we will use real valued inputs. Example 2.11 Assume we want to learn a simple threshold function with one-dimensional real-valued input vectors such that the output equals 1 for x, ='(1) and for x, =(1.5l, whereas the output must be 0 for the inputs x, =(0.25) and x. =(—0.S). In Figure 2.17 we have given for the extended input space the solution space for X‘ = {x, ,x, }={[1, 1]‘, [1, 1.5]‘} and X’ = {:‘r, , )‘t. }={[l,0.25]‘, [1, —().5]'}. Because the output for it, must be 1. we must have for the solution weight vector at: s-ft, >0, and thus s must be located to the right of line I, . For input )2_, the output must be 0 and thus s~)‘t, <O and thus s must be located to the left of the line I_, or on the line (3. We can do the same for the other inputs. The intersection of the separate solution spaces gives the solution space. indicated by the shaded area in Figure 2.17. learning a single-neuron binary Perceptron 31 I X‘ M. ) 2 2 —. x‘ xi“ two) Figure 2.17 Construction of the solution space of Example 2.11 Figureztti The construction of the sequence of weight vectors of Example 2.11 Let the initial weights be [wg(0), w, (O)] : (2, 1). then the initial separating hyperplane H(0l is defined by 2x. ,+x, =0 [see Figure 2.18). With this initial hyperplane the extended inputs ft, and it‘ give the wrong output. If we use the procedure for global learning we have to subtract from M0) the vector a(i, +x. t. With s=1 the new weight vector becomes v’v(l)= t2, 1)—(2. — 0.2S)= (0, 1.25). We see that the weight vector is changed in the direction of the solution space, The corresponding separating
  24. 24. 32 The binary Perceplron hyperplane is now identical with the xo-axis. and only for the input vector )2, = ( l , 0.25) do we obtain a wrong output because w, ,(l)'0+iv, (l). ', =0+(| t25)(0.25)>Oand hence , 5‘: l. Now we subtract from Mt) the input vector )‘t_, =(I,0.25). The next weight vector becomes v‘v(2) = ( l. I) . ’I|1('l1'C1l)0I‘(. lCl' ofthe solution 5 ILC. The output for the input vector 92, is now just wrong: } =0. Thus in the next step we add it. =(l. I) to the weight vector and we obtain ts 1 =10, 2). lnpiit vector )2, is now niiscliissificd, so we have to subtract from v”v_, the vector >‘. 'J= (l, 0.25]: the next weight vector v‘vl4t= (< l. l.75) is in the solution space. | The question arises of whether we will always enter. in ti finite number ofaduptation steps. the solution space. One can prove that for any constant learning riite i: [called / i,'t'iI in -uniml Imi'iiiii_i; l this will be the case — even for El tinie~varying learning rule like i: [A = l/ k or 1:(I: ) . These stutenients are consequences of the Pt'7't't’[lIl‘I| ll (‘IlIll't’l‘_t[L’llt'L’ ilit-nriuii. which will he disciissed in the next section. 2.5 The Perceptron convergence theorem This section mainly deiils with the formal statement of the Perccptroii convergence theorem and its proof. Because we have iilrciidy outlined the theorem in the previous section in an informal way. this section can be omitted on a lirst reiiding of the hook without loss of continuity. The Perceptron convergence theorem concerns the convergence of the letiriiiiig procedure to find. from . ‘amp| e. of correct behaviour. the linear threshold function y= Sl$'0+W| ,"+‘»X1++‘"X, .l realized by a single binary Perceptron, if the function_ f(', , x, .. ,x, ,) to he itlentitied is ti linear threshold function (the symbol S represents the step function). ln the previous section the variables y and . ', - were biliary valued; the Perceptniii convergence theorem is. however also applicable if the variables r are real valued. The rt-irifnrrmiwiii fL’ut'IllIlt[ riilu is given by: Let v‘v(0)= (w(, [0), w, (0), t‘”lll)) be any initial weight vector. Let t‘v(k)= (w, ,lk), iv,1k), .. w, ,(k)t he the weight vector at step k. Let i: (k) he a variable luiiriiiiig raw. Local leaming If at step k: |$’”lli'l+“ll'l, y: /'[x, l:l for some input ' + | 'ull'l‘, ; + +>i‘, ,lli)", ,<l) and it is given that tor X, t', ,, -, Z,. .._ ’. ,,). then change Mk) into: fvl/ i t | l:v'ill«)+i: (ktv2, If at step A: W. ,ll'l+| llI'l, .‘l‘‘; llilV‘: +' -+iv, ,tk), ,,>t) and it is given that V y= f(x_)= O for some input vector x, = The Perceplron convergence lheorem 33 X”, x.2,, .,, x,, ,), then change i‘v(k) to: mt +1): Mk) ~e(k)»‘<i Clobal learning [f we): )1 it -2 it] with file? “ ‘(kl and ftléf '(k) with T ‘(l<) the set of extended input ‘ . 7 vectors with target value lland actual output of the neuron atjtcp k'? lqu«| ‘l I0‘ T'(k) the set of extended input vectors with target value 0 an actua ou pu t neuron at step k equal to I. then change the weight vector into. v‘v(k + l)= W(k)+Bl‘<)5(k) The Perceptron convergence theorem states: in the case of local or global learning the weight vector Mk) converges to a solution vector 5 if the imples are linearly separable and if the following conditions on the learning rule 1; )are sutished: I. i: (kl>0 2. lim E 1:(k)= sc m—l k= l Z (fi(k)l’ l =0 3. lim 7,. 7. ( 2 W) The conditions imply that convergence occurs for any positive constant learning r-ite r or if rm: 1 ’k or even if it increases like ntklzk. If the set of samples is not linear separable. then the separating" hyperplane defined by 2770 t". -1:0 ‘oscillate between several positions il_the learning rate is conlstant or increaf “g1.ula[C samc occurs when the data set contains contradicting samp es. S0 M3 Ccln Of the following practical statement: Practical statement 2.] A decreasing value of the learning rate a(k) is particularly advisable ifhthe set ofisamglzs may not be linear separable or contains contradictions, because in I at ease e e of ‘ilisruptive’ samples will be reduced. (Note: For a formal proofofthc theorem it is required that during learningfor correct classification the inner product v‘w‘(>o for xeT, ,, and w-xst) for xeT. , With 0 21 certziin small positive constant. ) ' ' h We will not give a general proof of the theorem and restrict it to the case w ere at-)= I/ |c(I<)| for global learning l I
  25. 25. 34 The binary Perceptmn Proof If the learning process converges, then there exists a solution vector 5 such that after some finite time m: v'v[tti)= §. Ifs is a solution vector, then the unit vector fl= §/I§I is also a solution vector! "|h1us v”v(m)= /lit for some / l>O. If v”v(ni)= /lit, then ft-wtm)/ lv‘v(m)l= l. At every step k we have for the value of the cosine of the angle between it and w(k): ii-v‘v(k)/ |iiv(k)| s I. We will show that at each step i‘rw(k)/ )v‘v(k)l will increase with a positive amount and thus after a finite number of steps must become equal to I. Consider it-‘v(k+ l)= £t-(v‘v(l<)+c(k)é(k)). Because we take etk)= I/ |etk)| we obtain ii-v'v(k + l)= ft-w(k)+ Ii-E(k)/ létltll. Because é(k) = )2 it, ~ 2 it, with >‘t, eT, ,‘(li') and it's 'f‘, , ‘(kt we have I't~é(k)= <3(k)>0. Let §’= min, (6(kt/ |c(k)l. So at each step the inner product ft-w(k) is increased with at least the value 5'. thus i'i~w(l<)>kt5’. (Note that if we take n(k)> 1/| e(k)| , then the increments at each step will be larger. ) Now consider | v'vtk + l)|2 = |v'v(k) + c(k)é(k)| ‘ = lW(k)lz + 2v‘v(kl~€(k)/ |E(k)| + l. Because w(k)vé(k)<0 we have | v”v(k + l)l’ <| w(k)l‘ + 1. This implies lift/ [k)i2 < lv‘v(0)1’ +1: and thus | iiv(k)| <([v‘v(0)l’ +k)‘”. The expression i‘rv'v(l<)/ lv‘v(k)| can thus be approximated by: Ii-€v(k)| /v‘vtk)| >6'k/ (lwt0)| ‘ + k)‘/2 and thus after a finite iiumher of steps k the value of ii-w(k)/ |w(k}| must become equal to l and thus a solution vector must he reached. Qizi) We have found that after a finite number of steps the reinforcement learning rule will provide a correct solution if certain conditions are siitislietl There are, however. other learning rules that will give correct solutions under c tain conditions. These rules are also based on the grzulietii ilust-etit prot-vilurus for mininiiring certain L'I‘fIL’I‘f(7Il functions, like msifitririizins (for misclassification). sometimes called (‘7‘7‘tlI’V/ ftIlt'If4ltlr' or energy futtciiuns. The criterion function E(v‘v) is a positive scalar function which is zero if W is equal to a solution vector. With the gradient descent procedure we move through :1 sequence of weight vectors v”v, ,, v‘v, , v'v, ,.. . such that E(w. ,)>l: '[v‘v, )>l§ti7v2)> and finally end up with E(w, )=0. The procedure starts with some arbitrary weight vector we. then computes the gradient vector VE(ifv0). The next weight vector is obtained by moving in the direction of the steepest descent, i. e. along the negative of the gradient vector VE(w, ,). ’ An obvious choice for Etw) would be the number of misclussitied input vectors for the weight vector it. But in that case E(w) would be piecewise con‘ it and the gradient of cm) is zero or undefined, so this would be a poor candidate for the criterion function. The reinforcement learning rule used before is obtained if we minimi/ .e the Pnrcaptron criterion furtftfurt, i. e. em): 2 —W'i with T, ,=T* . tul*TiZl id. Because it-32 is negative for each xe'i'w the criterion function Elfv] is always positive Pei-tormance of at two-layer Percepn-on 35 if there are misclassified input vectors, and it will only be zero fora solution vector s. Because AE(v‘v}= V{E(w)}Aw we have the situation that E{v‘v} will reduce if Aw is proportional to the negative value of the gradient vector V{E(_w)}. Thfikllh C0mP°“9“‘ of the gradient vector 6E(v‘v)/ Ew, is equal to x, -, thus we obtain for Aw: Av”v= u- Z it id. as we used in the reinforcement learning rule. Another criterion function could be, for example: E'[w)= Z (v’v‘i‘r)2 with T“ = T, :U{ ~ T; } 251‘. Although this criterion function can be used. it turns out that the corresponding learning rule is inferior to the reinforcement learning rule, 2.6 Performance of a two-layer binary Perceptron In the previous sections we found that a single-neuron binary Perception can realize linear threshold functions with respect to the binary variables x‘, x, ... ., x, ..VH0WCV€I. most logical functions f: {0, l}"~(0, 1) are not linear threshold functions with respect to x, , x, . , .. .~<, ,. In this section. we will show that any logical function can be realized with a two-layer binary Perceptren with one neuron in the second layer. . ' First we define some concepts, using a terminology related to pattern recognition. The argument values of logical functions will be called patterns. so we have a pattern set: P= (0, 1)"; a pattern p, is an element of P, i. e. pl-= [p, |, p, :,. .., p, J with p. .é(0» 1l- . The intersection r, of a pattern p, e{0, 1}" and a pattern q, e(0.1}" is defincd b)’ p, nqJ: r,‘, with r, ,_=1 if p, _=1 and q, _= I and with rt_= O otherwise. The union rk of a pattern p‘s{0, l)" and Lt pattern q, s{0, 1}" is defiflfid by PiU‘l, =|‘i« With r, ,_= l if p. _:l or qJ_= l or both and with r, __= O otherwise. _ _ p The order of a pattern 1), » is defined as the number of Is occurring in p, -, and is denoted by lp. l. ' . In the previous section we treated x} as a binary variable; however, it is helpful to consider X, as a function: x]: P»(0. 1) defined by: x, tii. ) = t2., Because xjtpi) depends on one component, or one pixel, from the pattern 1). we will call xi a pixelfiinction. _ We introduce a special logical function which acts as a mask by which we observe a pattern. Let qeP, then x“: P—v{0, 1‘, is a so-called mask fimctimi defined by: xq(p, )=l if qrpi= q and xqtp_)=0 otherwise
  26. 26. 36 the binary Perceptron Example Z. l2 xHm, (l| l0)= l and x, ,(, u(l00ll=0 I A mask function can be written tisl", a_§productVof pixel functions. Let qel’. then x. .(t>i. )= u, lPt. )-'2i, :lPnl'"-§. ,,(Pn) ‘ with = x.(p. .) if4.=1 =1 irq, .=o -i. .,ttn. >— We will call q in xqlph) the ntrt. 'k-of the mask function x“. Example 2.13 For n= ] x, (,, =x, .r_, I We define the SI! /Mlfllllllll S, ‘ ofa pattern p, as the set of patterns: S, ‘ = 'qA| xm(p, ) = l }. Example 2'4 S(. u,, ={(XXX). (l0|0.000l. OOlll | We define the (‘over (‘, _ of ti pattern p, as the set of patterns: CE: {q, |x, _tqk)= 1}. Example 2.15 C, ,(. ,,= {0()ll.0|ll, l0ll. l|ll} I lfa logical function y: {0. l}"—~»}t), l} is written as an arithmetical linear combination of the mask function: _vl| 't))= ):t‘, x“‘(p1) with w, eR then we will call such a form an itritltrtwtit-til conjunctive normal limit, In the case where the logical function is written as: ylp, )=S(Zir, -x, [vlp‘)) with w, elRi and S the step function then we call this form an ittzlirwl tlI'lI/ llI1(‘ll('tll ('0IljUIl('flUt' nornutlji»-ni. Now we can state the following theorem: Theorem 2.5 Any logical function 3'; (U. l}"—~§tl. I) can he written in an arithmeticztl contunctive Performance of a two-layer Perceptron 37 Table 2.11 M X2 J’ 0 0 0 I 0 I 0 l l l l I 0 normal form: 5-: Zw. ‘~x_l q, e(O, 1)” ‘Ir with W‘, an integer such that for each pattern p‘eP: 'q’= ylp) with S, ‘ the substratum of p, Before proving the theorem we will give an example of the theorem. Example 2.16 The exclusive—or function defined by Table 2.11 can be written as: y= wl)(Ix00+ w|0xl0+ WUIXOI + WI lxll It turns out that w0(, =0, w, °= 1, wo, =l and w”: *2. Thus: y= x,+xz~2x, x2 I Proof of Theorem 2.5 We have to prove that for any logical function y: (0, l}"—>{0, 1} we can find a unique set of coefficients W. ‘ such that: Z w. ;x. ,tt»it= m2.-) for each t>. EP= t0. 1}" q, eP Because x_l(pi)= 1 if 41153,‘ and 0 otherwise, we have for every p, : Z w. ,;x. ,tt! .)= X w. ;x. ,(t2.) ‘IFP W55» For 41,65“ we have X, ‘,(p‘)= 1, thus: 2 “'u, "‘t, l1‘l)= 2 Way 15314 qléxw
  27. 27. 3a The binary Perceptron Thus if: 2 w, ".= y(pi) for each p, eP q, es, , -«H 5 we obtain: , Z W. ,‘xq, (p. )= ytin) 1,6? Moreover we have for a total function 2" patterns and thus 2" independent linear equations of the form: 2 win: .V(PIi us, Because the number ofcoeffieienls wq, is 2". the solution is unique, QED Example 2.17 Given the function y: (0, l}"-(0, 1} specified by Table 2.12. with the conditions for 2 w. given in the last column. The solution for the set equations is: Wooo=1 Wl00=0 W0lll=0 Wini= “1 Woio= ‘1 Wim=0 Won=0 Wni=2 Thus the function y can be written as a linear combination of mask functions: y= l—xz—x, x3+2x, x,x, I In Section 2.2 we defined a linear threshold function with respect to the binary Table 2.l2 Xi *2 X3 3' pa 0 0 0 1 = w,, m, rt. 0 0 1 1 = wm+wao. [)1 0 1 0 0 = wWo+w, ,,, , P: 0 1 1 0 Wono+Wno: +Wo: o+Won P4 1 0 0 1 = Wooo+Wmo ll: 1 0 1 0 = Wmm+Wcm+Wma+Wm: 1:6 1 1 0 0 wm, °+w,0,, +wo, .,+w, ,,, l l 1 l = Wnoo+Woni+Wom+Wnu+Wmu+Wim +Wnu+Wn: Performance of a two-layer Perccplron 39 variables x, , xz, ... . x, , as a function that can be written as y= S()2 w, xi~ — T), with S the step function, From Theorem 2.5 the next theorem follows immediately. Theorem 2.6 Any logical function y: (0, l}"—>(0, 1) is a linear threshold function with respect lo the set of binary mask functions, i. e. y= 2 Wm, Cr or y= S( 2 Wq"Xq‘- T) with threshold T= - woomo ¢: ’Iu Example 2.18 The exclusive-or function of Example 2.l6 is a linear threshold function with respect to the binary mask functions: x, ,., = I, x, .,= ,r, . x, ,, = x, and x, , = x,. r,: y= Stx, +x, —2x, .', —T) with T= w°o=0 j A central theorem is as follows: Theorem 2.7 Any logical function f: {O. l)"«~(0, 1} can be realized by a simple two-layer binary Peroeptron. Proof Any logical function y: (0, 1}"—>(0, 1} is a linear threshold function with respect to the binary valued mask functions: y= S( 2 wm-xq/ —T> with threshold T= —w0o __0 ‘l. “lo Given the binary valued mask functions xq, one single binary output neuron can realize the linear threshold function with respect to the mask functions xq. The threshold of the second—layer neuron is equal to —wm, ‘_‘(, of the linear threshold function y. Any mask function x, .=xixJ-"x, is a linear threshold function with respect to x; ,x, ».. ... x,. because xq (x, +xj+ +x, ,—T) with T equal to the number of variables in the product , xJ -X. minus I, and thus it can be realized by a single WI] 1‘-
  28. 28. an The binary Perceptmn binary neuron in the first layer. Such a first«layer neuron has an input x, » for every x, - occurring in the product it, ‘ with a corresponding weight w_= |. The output of such a first-layer neuron equals I if )_¢q(p, )= 1, and 0 otherwise. All neurons realizing mask function‘ constitute the first layer of the Perceptron. The output of a first-layer ricuroh realizing a mask function x, ‘ is multiplied by the synaptic weight w, of the connection toithe output neuron. Thus the output of the second-layer neuron equals l if and only if: X W, ‘-x, ,> — wm, _ 4, QED 1 Figure 219 gives tlte two-layer binary Perceptron which realizes the logical function described in Example 217. Figure 2.20 gives an alternative realization of the same function. In Theorem 2.5 we found that any logical function can be written in an arithmetical conjunctive normal form, The proof of the theorem also revealed a methntl of how to find that arithmetical conjunctive normal form. There is. however, another method to find the arithmetical conjunctive normal form. lfwe have written a logical function as a Boolean function then we can convert it in a systematic way into an equivalent arithmetical function using the following rules: l. lf X, is Boolean function ta single variable), then x, is an equivalent arithmetical function. hgure 2.l| ) 'l‘w<>—| ayer binary Pereeptron illustrating Table lll Performance of a two-layer Perceplron 41 Figure 2.20 Alternative configuration of the binary Perceprron of Figure 2.19 . If E, is Boolean function (the inverse ofx, ). then l—x‘ is an equivalent arithmetical function. . If the Boolean function f, is equivalent with the arithmetical function f’, and the Boolean function fz is equivalent with the arithmetical function / '2, then the Boolean function f, A f, is equivalent with the arithmetical function f’, /’; . . If the Boolean function f, is equivalent to the arithmetical function f’, and the Boolean function f; is equivalent to the arithmetical function f’2. then the Boolean function f, vfz (also written as / , +13) is equivalent to the arithmetical function Stf‘, +f', ), with S the step function. . If the Boolean function f, is equivalent to the arithmetical function f’, and the Boolean function f2 is equivalent to the arithmetical function 1", and the Boolean functions f, and f, are not true for the same arguments. then the Boolean function f, v f2 is equivalent to the arithmetical function I’, +f'2.
  29. 29. 42 11-. binary Perceptron Example 2.19 Consider the logical function specified by Table 2.13. This function can be specified by the Boolean function: y= ;'<, .f, :?, +i, i,x, +x, >E, i,+x, x,x, . Because one and only one argument ofy can be true we can use rule (5) above and obtain the equivalent arithmetical conjunctive normal form: J'= (| -Xilfl -Xzlll -X; l+(1 'Xil(l "XzlX: +X1(l’X2)ll'X: l+X1X1X3 or y= I —x, ~x, x,+2x, x,x, as found before in Example 2.17 I A logical function can also be considered as a charucterirlicfunetion ofsome set Q E P. A characteristic function ofa set QSP is a logical function fa such that fQ(p, )=l if | );EQ and fqtp, -)=0 otherwise. As a consequence of the conversion rules stated above. we have the next theorem. in Theorem 2.8 we will use the complement of p, - denoted by it; (e. g. if p, »=(0l0) then fl; =<l0l>). Theorem 2.8 A characteristic function f“: P—~(O, 1} for the singleton K = {pi} can always be written in the arithmetical conjunctive normal form: fl: 2 (_ I ]ru. na. txq_ q_r»('_ with C, ‘ the cover of pattern p, and lqmnpil the order of q, ,,ri), [i. e. the number of ones occurring in q, ,,nf), ). Proof Copsider xi asa Boolean function, then any characteristic function ofthe singleton {p, , can be written as a Boolean product ]'f, “=i, i,-~-i, , with EJ= XI ifl-, "=1 and Performance of a twtrlayer Perceptron 43 £, =:E, (the negation of x, -) if | ),I=0. (For example for n=3 and pattern p, -=(t01>: fill: Mix)‘: -l . ’1‘he Boolean function x, is equivalent to the arithmetical pixel function x, . The Boolean function i, is equivalent to the arithmetical function (1—xi). thus if we . replace the Boolean variables in the Boolean product by the corresponding . arithmetical functions we obtain an equivalent arithmetical product. The elimination of the brackets in this product (and some thinking) gives the desired result. QED 1 Example 2.20 Let f: =)E, :2,x, x, be the Boolean form of the characteristic function of the singleton (p, } = (0011). The complement ofp, - is: [': ,= 1100. The cover ofp, ~ is: C, ‘=(00l1, 1011, 0111, 111). Thus we can replace the Boolean function by the equivalent arithmetical function: foot: ='(— lllmixooii +(‘ llmmlxroti +l‘1lmomxt)|11+l’l)mo°lx1t11 or simply: , =x, x.—x, x,x, -x2x, x.+x, x,x, x, (note: )? ,i; x3x. =(1 -x, )(1 -x, )x, x,) I Theorem 2.8 can be extended to sets of patterns as follows: Theorem 2.9 A characteristic function / K: P—>{0, 1) for a class of patterns KSP can always be written in the arithmetical conjunctive normal form: fit: 2 fr. v. <K with: A: Z (_1)lq. .n11.lx% IHEC, ‘ Proof If f; is the characteristic function of a set of patterns S (e. g. a singleton (p, -}) and f, ’ the characteristic function of the singleton (1), ). then one easily verifies that f5+f, ' is the characteristic function of the set Sulpl). QED Example 221 For n=2 let K= {00,10). The characteristic function of {O0} is 1-x, —x, +x, x1
  30. 30. 44 The binary Perceptmn and the characteristic function of {I0} is x, —x, x2. The characteristic function of K= {00. I0} becomes l—-xz. | A mask function xqisthc clntractor K1 ‘function ofa set of patterns K“ delined by: t1.ix. .(i: .)'= Ii Because this definition of K, ‘ is identical with the definition of the cover C. ‘ of pattern 1], we have the following lemma: Lemma 2.3 The mask function X‘. is the characteristic function of the t'lIl‘L'I C, ‘ of pattern I}. This lemma gives us a convenient set-theoretical interpretation ofthe aritluneticul conjunctive normal form ofa logical function y. :tud a way to determine the arguments for which y= I. in a subsequent section we will see how it can also give us the me: 3 to show that a two-layer binary Perceptron is able to generalize front samples of desired behaviour. Example 222 The arithmetical conjunctive normal form of the exclusive-or function i follows: y xi +. '1—2x, .’1 Thus y= I for the elements of the set specified by: C, ,,+C. ,,-2(‘, ,='l0. ll} +{0l, ll}—2~{ll} Ii 110.0!) In this example we used the operations of addition. subtraction and multiplication of sets. These operators must be defined in a formal way. We will come back to it in Section 2.8. I Because of Theorem 2.8 a characteristic function f, of a pattern p, can be written in an arithmetical conjunctive normal form: r, .= 2 «- . ._. z N and thus the class containing only p: cm p‘ can also be written 2 n arithmetical sum of the covers CI" (the set of patterns covering pattern It») repre -ntcd hy the niask functions in the aritlunc eal cmnunctive normal form: }tI.1= ; 1 eirw~*c. _ I'M‘ 1 h The adaptive recruitment learning rule 45 Example 2.23 For 11:4 and pattern p, =00l 1 the characteristic function is as follows: fp, ="3X4‘Xt3‘: Xa’3‘27‘3x4+7‘iX2"3X4 Thus lPxi= Coon"Ctnit _CUIH+C| Il| Hence p, }= 00|| }= llll,0l1l, l0ll.00ll — llll, l0ll — llll, l)lll‘+ llll I 2.7 The adaptive recruitment learning rule In the previous section we found that any logical function y: (0, l}"—~{0, I} can always be written in the arithmetical conjunctive normal form, and that every arithmetical conjunctive normal form can be realized by a two-layer binary Perceptron. A two-layer neural network that realizes such a log" al function by some linear threshold function with respect to the mask functions xq. ytnii = S< Z w. .'x. (i>. J - T) I must have for each qE{0, 1}” for which w, .#0, an input neuron realizing xq, moreover the weight of the connections from the output of a first-layer neuron realizing x, ‘ to the input of the output neuron must be equal to wq. Assuming we do not have an explicit description of the threshold function but only a finite set D, the training set, of examples consisting of pairs (p, ,y(p, )>, then we investigate whether we can develop a learning rule leading to the recruitment of the required first-layer neurons and correct values of the weights wq such that at the end of the learning process the two-layer binary Perceptron will at least give the correct response to all patterns of D. A logical function y: P—v{0, 1} with P= {0, 1}" can be considered as a pattern classification function, i. e. y(p, ») I if p, belongs to some subset K of P and y(p, )=0 if p; belongs to the complement K = P—K. The subset of patterns in the given data set D that are elements of K will be called the set ofexamples E, and the subset of patterns of D that are elements of K will be called the set of cvunterexamples. The learning rule we will present will give the correct response to the training set D after a finite learning time and will generalize in some sense on the basis of the training set to other inputs not present in the training set. In contradistinction with learning rules for other artificial neural nets, we have to present every element of the training set D just once. This implies that the learning time is proportional to the