Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

intelligent system

presentation on intelligent system by suneel kumar

  • Login to see the comments

intelligent system

  1. 1. //shri krishnan // Introduction - Neuron Physiology, Artificial Neurons Learning, Feed forward, feedback networks, Features of ANN, Training algorithms: Perceptron learning rule, Delta rule, Back propagation, RBFN, Recurrent networks, Chebi-chev neural network, Connectionist model. Tuesday, December 10, 2013 1
  2. 2. • They are extremely powerful computational devices (Turing equivalent, universal computers) • Massive parallelism makes them very efficient • They can learn and generalize from training data – so there is no need for enormous feats of programming • They are particularly fault tolerant – this is equivalent to the “graceful degradation” found in biological systems • They are very noise tolerant – so they can cope with situations where normal symbolic systems would have difficulty • In principle, they can do anything a symbolic/logic system can do, and more. (In practice, getting them to do it can be rather difficult…) Tuesday, December 10, 2013 2
  3. 3. What are Artificial Neural Networks Used for? As with the field of AI in general, there are two basic goals for NN research: – Brain modeling: The scientific goal of building models of how real brains work • This can potentially help us understand the nature of human intelligence, formulate better teaching strategies, or better remedial actions for brain damaged patients. – Artificial System Building : The engineering goal of building efficient systems for real world applications. • This may make machines more powerful, relieve humans of tedious tasks, and may even improve upon human performance. Tuesday, December 10, 2013 3
  4. 4. • Brain modeling – Models of human development – help children with developmental problems – Simulations of adult performance – aid our understanding of how the brain works – Neuropsychological models – suggest remedial actions for brain damaged patients • Real world applications – Financial modeling – predicting stocks, shares, currency exchange rates – Other time series prediction – climate, weather, marketing tactician – Computer games – intelligent agents, backgammon, first person shooters – Control systems – autonomous adaptable robots, microwave controllers – Pattern recognition – speech & hand-writing recognition, sonar signals – Data analysis – data compression, data mining – Noise reduction – function approximation, ECG noise reduction – Bioinformatics – protein secondary structure, DNA sequencing Tuesday, December 10, 2013 4
  5. 5. A Brief History • 1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model • 1949 Hebb published his book The Organization of Behavior, in which the Hebbian learning rule was proposed. • 1958 Rosenblatt introduced the simple single layer networks now called Perceptrons. • 1969 Minsky and Papert’s book Perceptrons demonstrated the limitation of single layer perceptrons, and almost the whole field went into hibernation. • 1982 Hopfield published a series of papers on Hopfield networks. • 1982 Kohonen developed the Self-Organizing Maps that now bear his name. • 1986 The Back-Propagation learning algorithm for Multi-Layer Perceptrons was rediscovered and the whole field took off again. • 1990s The sub-field of Radial Basis Function Networks was developed. • 2000s The power of Ensembles of Neural Networks and Support Vector Machines Tuesday, becomes apparent. 5 December 10, 2013
  6. 6. The Brain vs. Computer 1. 10 billion neurons 2. 60 trillion synapses 3. Distributed processing 4. Nonlinear processing 5. Parallel processing Tuesday, December 10, 2013 1. Faster than neuron (10-9 sec) cf. neuron: 10-3 sec 3. Central processing 4. Arithmetic operation (linearity) 5. Sequential processing 6
  7. 7. Computers and the Brain – – – – – – – – Arithmetic: Vision: 1 brain = 1/10 pocket calculator 1 brain = 1000 super computers Memory of arbitrary details: Memory of real-world facts: computer wins brain wins A computer must be programmed explicitly The brain can learn by experiencing the world Computational Power: one operation at a time, with 1 or 2 inputs Brain power: millions of operations at a time with thousands of inputs Tuesday, December 10, 2013 7
  8. 8. Inherent Advantages of the Brain: “distributed processing and representation” – – – – Tuesday, December 10, 2013 Parallel processing speeds Fault tolerance Graceful degradation Ability to generalize 8
  9. 9.  We are able to recognize many i/p signals that are somewhat different from any signal we have seen before. E.g. our ability to recognize a person in a picture we have not seen before or to recognize a person after a long period of time.  We are able to tolerate damage to the neural system itself. Humans are born with as many as 100 billion neurons. Most of these are in the brain, and most are not replaced when they die. In spite of our continuous loss of neurons, we continue to learn. Tuesday, December 10, 2013 9
  10. 10. There are many applications that we would like to automate, but have not automated due to the complexities associated with programming a computer to perform the tasks. To a large extent, the problems are not unsolvable; rather, they are difficult to solve using sequential computer systems. If the only tool we have is a sequential computer, then we will naturally try to cast every problem in terms of sequential algorithms. Many problems are not suited to this approach,  causing us to expend a great deal of effort on the development of sophisticated algorithms,  perhaps even failing to find an acceptable solution. Tuesday, December 10, 2013 10
  11. 11. Problem of visual pattern recognition an example of the difficulties we encounter when we try to make a sequential computer system perform an inherently parallel task Since the dog is illustrated as a series of black spots on a white background, how can we write a computer program to determine accurately which spots form the outline of the dog, which spots can be attributed to the spots on his coat, and which spots are simply Tuesday, distractions? 11 December 10, 2013
  12. 12. An even better question is this: How is it that we can see the dog in the image quickly, yet a computer cannot perform this discrimination? This question is especially poignant when we consider that the switching time of the components in modern electronic computers are more than several orders of magnitude faster than the cells that comprise our neurobiological systems. Tuesday, December 10, 2013 12
  13. 13. The question is partially answered by the fact that the architecture of the human brain is significantly different from the architecture of a conventional computer. The ability of the brain to perform complex pattern recognition in a few hundred milliseconds, even though the response time of the individual neural cells is typically on the order of a few tens of milliseconds, is because of  the massive parallelism  interconnectivity Tuesday, December 10, 2013 13
  14. 14. In many real-world applications, we want our computers to perform complex pattern recognition problems. Our conventional computers are obviously not suited to this type of problem. We borrow features from the physiology of the brain as the basis for our new processing models. Hence, ANN Tuesday, December 10, 2013 14
  15. 15. Biological Neuron • Cell structures – Cell body – Dendrites – Axon – Synaptic terminals Tuesday, December 10, 2013 15
  16. 16. 1. Soma is a large, round central body in which almost all the logical functions of the neuron are realized (i.e. the processing unit). 2. The axon (output), is a nerve fibre attached to the soma which can serve as a final output channel of the neuron. An axon is usually highly branched. 3. The dendrites (inputs) are a Synapses highly branching tree of fibers. Axon from These long irregularly shaped other nerve fibers attached to the soma neuron Soma carrying electrical signals to the cell 4. Synapses are the point of contact between the axon of one cell & the dendrite of another, regulating a chemical connection whose strength affects the input to the cell. Tuesday, December 10, 2013 Axon Dendrites Dendrite from other The schematic model of a biological neuron 16
  17. 17. Biological NN • The many dendrites receive signals from other neurons. • The signals are electric impulses that are transmitted across a synaptic gap by means of a chemical process. • The action of the chemical transmitter modifies the incoming signal (typically, by scaling the frequency of the signals that are received) in a manner similar to the action of the weights in an artificial neural network. • The soma, or cell body sums the incoming signals. When sufficient input is received, the cell fires; that is, it transmits a signal over its axon to other cells. • It is often supposed that a cell either fires or doesn't at any instant of time, so that transmitted signals can be treated Tuesday, 17 Decemberas binary 10, 2013
  18. 18. Several key features of the processing elements of ANN are suggested by the properties of biological neurons 1. The processing element receives many signals. 2. Signals may be modified by a weight at the receiving synapse. 3. The processing element sums the weighted i/ps. 4. Under appropriate circumstances (sufficient i/p), the neuron transmits a single o/p. 5. The output from a particular neuron may go to many other neurons (the axon branches). Tuesday, December 10, 2013 18
  19. 19. Several key features of the processing elements of ANN are suggested by the properties of biological neurons 6. Information processing is local. 7. Memory is distributed: a) Long-term memory resides in the neurons' synapses or weights. b) Short-term memory corresponds to the signals sent by the neurons. 8. A synapse's strength may be modified by experience. 9. Neurotransmitters for synapses may be excitatory or inhibitory. 19 Tuesday, December 10, 2013
  20. 20. ANNs vs. Computers Digital Computers Artificial Neural Networks • Analyze the problem to be solved No requirements of an explicit description of the problem. • Deductive Reasoning. We apply known rules to input data to produce output. Inductive Reasoning. Given i/p & o/p data (training examples), we construct the rules. • Computation is centralized, synchronous, and serial. Computation is collective, asynchronous, and parallel. • Not fault tolerant. One transistor goes and it no longer works. Fault tolerant & sharing of responsibilities. • Static connectivity. Dynamic connectivity. • Applicable if well defined rules with precise input data. Applicable if rules are unknown or complicated, or if data are noisy or partial. Tuesday, December 10, 2013 20
  21. 21. A NN is characterized by its: 1. Architecture Pattern of connections between the neurons 2. Training/Learning algorithm Methods of determining the weights on the connections 3. Activation function Tuesday, December 10, 2013 21
  22. 22. Neurons A NN consists of a large number of simple processing elements called neurons.  Each input channel i can transmit a real value xi.  The primitive function f computed in the body of the abstract neuron can be selected arbitrarily.  Usually the input channels have an associated weight, which means that the incoming information xi is multiplied by the corresponding weight wi.  The transmitted information is integrated at the neuron (usually just by adding the different signals) and the primitive function is then evaluated. Tuesday, December 10, 2013 22
  23. 23.  Typically, neurons in the same layer behave in the same manner.  To be more specific, in many neural networks, the neurons within a layer are either fully interconnected or not interconnected at all.  Neural nets are often classified as single layer or multilayer.  The i/p units are not counted as a layer because they do not perform any computation.  So, the no. of layers in the NN is the no. of layers of weighted inter-connet links between slabs of neurons. Tuesday, December 10, 2013 23
  24. 24. Types of Neural Networks Neural Network types can be classified based on following attributes: • Applications . -Classification -Clustering -Function . approximation -Prediction • Connection Type - Static (feedforward) - Dynamic (feedback) Tuesday, December 10, 2013 •Topology - Single layer - Multilayer - Recurrent - Self-organized • Learning Methods - Supervised - Unsupervised 24
  25. 25. Architecture Terms • Feed forward – When all of the arrows connecting unit to unit in a network move only from input to output • Recurrent or feedback networks – Arrows feed back into prior layers • Hidden layer – Middle layer of units – Not input layer and not output layer • Hidden units – Nodes that are situated between the input nodes and the output nodes. • Perceptron – A network with a single layer of weights Tuesday, December 10, 2013 25
  26. 26. Single layer Net  A single-layer net has one layer of connection weights.  The units can be distinguished as input units, which receive signals from the outside world, and output units, from which the response of the net can be read. Although the presented network is fully connected, the true biological neural network may not have all possible connections the weight value of zero can be represented as ``no connection". Tuesday, December 10, 2013 26
  27. 27. Multi - layer Net  More complicated mapping problems may require a multilayer network.  A multilayer net is a net with one or more layers (or levels) of nodes (the so called hidden units) between the input units and the output units.  Multilayer nets can solve more complicated problems than single-layer nets can, but training may be more difficult.  However, in some cases training may be more successful, because it is possible to solve a problem that a single layer net cannot be trained to perform correctly at all. Tuesday, December 10, 2013 27
  28. 28. Recurrent Net • Local groups of neurons can be connected in either, – a feedforward architecture, in which the network has no loops, or – a feedback (recurrent) architecture, in which loops occur in the network because of feedback connections. Tuesday, December 10, 2013 28
  29. 29. Feedforward and Feedback Tuesday, December 10, 2013 29
  30. 30. Learning Process  One of the most important aspects of Neural Network is the learning process.  Learning can be done in supervised or unsupervised training.  In supervised training, both the inputs and the outputs are provided. o The network then processes the inputs and compares its resulting outputs against the desired outputs. o Errors are then calculated, causing the system to adjust the weights which control the network. o This process occurs over and over as the weights are continually tweaked.  In unsupervised training, the network is provided with inputs but not with desired outputs. o The system itself must then decide what features it will use to group the input data. Tuesday, December 10, 2013 30
  31. 31. Understanding Supervised and Unsupervised Learning A A Tuesday, December 10, 2013 B B B A 31
  32. 32. Two possible Solutions… A B A B B B A A A B • It is based on a labeled training set. • The class of each piece of data in training set is known. • Class labels are pre-determined and provided in the training phase. A Tuesday, December 10, 2013 B 32
  33. 33. Unsupervised Learning • Input : set of patterns P, from n-dimensional space S, but little/no information about their classification, evaluation, interesting features, etc. It must learn these by itself! : ) • Tasks: – Clustering - Group patterns based on similarity – Vector Quantization - Fully divide up S into a small set of regions (defined by codebook vectors) that also helps cluster P. – Feature Extraction - Reduce dimensionality of S by removing unimportant features (i.e. those that do not help in clustering P) Tuesday, December 10, 2013 33
  34. 34. Supervised vs Unsupervised • Task performed Classification Pattern Recognition • NN model Preceptron Feed-forward NN “What is the class of this data point?” Tuesday, December 10, 2013 • Task performed Clustering • NN Model Self Organizing Maps “What groupings exist in this data?” “How is each data point related to the data set as a whole?” 34
  35. 35. Activation Function • Receives n inputs • Multiplies each input by its weight • Applies activation function to the sum of results • Outputs result http://www-cse.uta.edu/~cook/ai1/lectures/figures/neuron.jpg  Usually, don’t just use weighted sum directly  Apply some function to weighted sum before use (e.g., as output)  Call this the activation function Tuesday, December 10, 2013 35
  36. 36. The Neuron Bias b x1 w1 Local Field Input signal x2 w2  xm  wm Synaptic weights Tuesday, December 10, 2013 v Activation function ( ) Output y Summing function A bias acts like a weight on a connection from a unit whose activation is always 1. increasing the bias increases the net input to the unit. Bias improves the performance of the NN. 36
  37. 37. Binary step function f ( x) 1 if x 0 if x Is called the threshold • Single-layer nets often use a step function to convert the net input, which is a continuously valued variable, to an output unit that is a binary (1 or 0) or bipolar (1 or - 1) signal. Tuesday, December 10, 2013 37
  38. 38. Step Function Example • Let threshold, f ( x) =3 1 1 if x 3 0 if x 1 3 2 Input: (3, 1, 0, -2) Tuesday, December 10, 2013 1 3 0 4 -2 0.3 -0.1 2.1 -1.1 Network output after passing through step activation function??? f (3) 1 38
  39. 39. Step Function Example (2) • Let threshold, 3 f ( x) = 1 1 if x 3 0 if x 1 0 2 10 Input: (0, 10, 0, 0) Tuesday, December 10, 2013 3 0 4 0 0.3 -0.1 Network output after passing through step activation function?? 2.1 -1.1 f ( 1) 0 39
  40. 40. Binary sigmoid • Sigmoid functions (S-shaped curves) are useful activation functions. • The logistic function and the hyperbolic tangent functions are the most common. • They are especially advantageous for use in neural nets trained by back propagation, because the simple relationship between the value of the function at a point and the value of the derivative at that point reduces the computational burden during training. Tuesday, December 10, 2013 40
  41. 41. Sigmoid • Math used with some neural nets requires that the activation function be continuously differentiable • Sigmoidal function often used to approximate the step function Tuesday, December 10, 2013 1 f ( x) 1 e steepness parameter x 41
  42. 42. Sigmoidal 1 0.9 0.8 0.7 0.6 1/(1+exp(-x))) 1/(1+exp(-10*x))) 0.5 0.4 0.3 0.2 0.1 Tuesday, December 10, 2013 5 4. 5 4 3. 5 3 2. 5 2 1. 5 1 0. 5 0 -1 -0 .5 -2 -1 .5 -3 -2 .5 -4 -3 .5 -5 -4 .5 0 sigmoidal(0) = 0.5 42
  43. 43. Sigmoidal Example Input: (3, 1, 0, -2) 0.3 -0.1 2.1 -1.1 Network output? Tuesday, December 10, 2013 2 3 1 f (x) 1 e 1 f (3) 1 e 2x 2x .998 43
  44. 44. • A two weight layer, feed forward network • Two inputs, one output, one hidden unit 1 f ( x) 1 e Input: (3, 1) 3 x ?? 1 0.5 0.75 1 -0.5 What is the output? Tuesday, December 10, 2013 44
  45. 45. Computing in Multilayer Networks • Start at leftmost layer – Compute activations based on inputs • Then work from left to right, using computed activations as inputs to next layer • Example solution – Activation of hidden unit f(0.5(3) + -0.5(1)) = 1 f ( x) 1 e f(1.5 – 0.5) = f(1) = 0.731 – Output activation 3 0.5 f(0.731(0.75)) = 0.75 f(0.548) = .634 .731 1 Tuesday, December 10, 2013 x -0.5 f(1) = 0.731 .634 f(0.548) = .634 45
  46. 46. Some Activation functions of a neuron Step function Sign function Sigmoid function Linear function Y Y Y Y +1 +1 1 1 0 X -1 -1 Y step Tuesday, December 10, 2013 0 1, if X 0 sign Y 0, if X 0 X 0 X -1 1, if X 0 sigmoid 1 Y 1, if X 0 1 e X 0 X -1 Y linear X 46
  47. 47. Function Composition in Feed-forward networks  When the function is evaluated with a network of primitive functions, information flows through the directed edges of the network.  Some nodes compute values which are then transmitted as arguments for new computations.  If there are no cycles in the network, the result of the whole computation is well-defined and we do not have to deal with the task of synchronizing the computing units. We just assume that the computations take place without delay. Function Composition Tuesday, December 10, 2013 47
  48. 48. Function Composition in Recurrent networks  If the network contains cycles, however, the computation is not uniquely defined by the interconnection pattern and the temporal dimension must be considered.  When the output of a unit is fed back to the same unit, we are dealing with a recursive computation without an explicit halting condition.  If the arguments for a unit have been transmitted at time t, its output will be produced at time t + 1.  A recursive computation can be stopped after a certain number of steps and the last computed output taken as the result of the recursive computation. Tuesday, December 10, 2013 48
  49. 49. Feedforward- vs. Recurrent NN • activation is fed forward from input to output through "hidden layers" Output ... ... ... • connections only "from left to right", no connection cycle Input ... ... Output ... Input • at least one connection cycle • activation can "reverberate", persist even with no input • system with memory • no memory Tuesday, December 10, 2013 49
  50. 50. Fan- in Property  The number of incoming edges into a node is not restricted by some upper bound. This is called the unlimited fan-in property of the computing units. Evaluation of a function of n arguments Tuesday, December 10, 2013 50
  51. 51. Activation Functions at the Computing Units  Normally very simple activation functions of one argument are used at the nodes.  This means that the incoming n arguments have to be reduced to a single numerical value.  Therefore computing units are split into two functional parts:  an integration function g that reduces the n arguments to a single value and  the output or activation function f that produces the output of this node taking that single value as its argument.  Usually the integration function g is the addition function. Tuesday, December 10, 2013 Generic computing unit 51
  52. 52. McCULLOCH- PITTS (A Feed-forward Network) • It is one of the first of NN & very simple. – The nodes produce only binary results and the edges transmit exclusively ones or zeros. – A connection path is excitatory if the weight on the path is positive; otherwise it is inhibitory. – All excitatory connections into a particular neuron have the same weights. (However it my receive multiple inputs from the same source, so the excitatory weights are effectively positive integers.) Tuesday, December 10, 2013 52
  53. 53. – Although all excitatory connections to a neuron have the same weights, but the weights coming into one unit need not be the same as coming into another unit. – Each neuron has a fixed threshold such that if the net input to the neuron is greater than the threshold, the neuron fires. – The threshold is set so that inhibition is absolute. That is, any nonzero inhibitory input will prevent the neuron from firing. – It takes one time step for a signal to pass over one connection link. Tuesday, December 10, 2013 53
  54. 54. Architecture  In general, McCulloch-Pitts neuron Y can receive signals from any number of neurons.  Each connection is either excitatory, with w > 0, or inhibitory with weight –p. Tuesday, December 10, 2013 54
  55. 55. “The threshold is set so that inhibition is absolute. That is, any nonzero inhibitory input will prevent the neuron from firing.” What threshold value should we set? The threshold for unit Y is 4 Tuesday, December 10, 2013 55
  56. 56. • Suppose there are n excitatory input links with weight w & m inhibitory links with weight –p, what should be the threshold value? • The condition that inhibition is absolute requires that for the activation function satisfy the inequality: Θ > nw – p • If a neuron fires if it receives k or more excitatory inputs and no inhibitory inputs, what is the relation between k & Θ? Tuesday, December 10, 2013 kw Θ (k-1)w 56
  57. 57. Some Simple McCulIoch-Pitts Neurons • The weights for a McCulIoch-Pitts neuron are set, together with the threshold for the neuron's activation function, so that the neuron will perform a simple logic function. • Using these simple neurons as building blocks, we can model any function or phenomenon that can be represented as a logic function. Tuesday, December 10, 2013 In the following e.g. we will take threshold as 2 57
  58. 58. AND Tuesday, December 10, 2013 58
  59. 59. OR Tuesday, December 10, 2013 59
  60. 60. Generalized AND & OR Gates?? Generalized AND and OR gates Tuesday, December 10, 2013 60
  61. 61. XOR x1 ? y x2 ? • How long do we keep looking for a solution? We need to be able to calculate appropriate parameters rather than looking for solutions by trial and error. • Each training pattern produces a linear inequality for the output in terms of the inputs and the network parameters. These can be used to compute the weights and thresholds. Tuesday, December 10, 2013 61
  62. 62. Finding the Weights Analytically • We have two weights w1 and w2 and the threshold , and for each training pattern we need to satisfy So what inequations do we get? Tuesday, December 10, 2013 62
  63. 63. • For the XOR network – Clearly the second and third inequalities are incompatible with the fourth, so there is in fact no solution. – We need more complex networks, e.g. that combine together many simple networks, or use different activation / thresholding / transfer functions. Tuesday, December 10, 2013 63
  64. 64. McCulloch–Pitts units can be used as binary decoders Suppose F is a function with 3 arguments. Design McCulloch-Pitts unit for (1,0,1). Decoder for the vector (1, 0, 1) Assume that a function F of three arguments has been defined according to the following table. Design McCulloch-Pitts units for it. To compute this function it is only necessary to decode all those vectors for which the function’s value is 1. Tuesday, December 10, 2013 64
  65. 65.  The individual units in the first layer of the composite network are decoders.  For each vector for which F is 1 a decoder is used. In our case we need just two decoders.  Components of each vector which must be 0 are transmitted with inhibitory edges, components which must be 1 with excitatory ones.  The threshold of each unit is equal to the number of bits equal to 1 that must be present in the desired input vector.  The last unit to the right is a disjunction: if any one of the Tuesday, 65 specified vectors can be decoded this unit fires a 1. December 10, 2013
  66. 66. Absolute and Relative inhibition Two classes of inhibition can be identified:  Absolute inhibition corresponds to the one used in McCulloch–Pitts units.  Relative inhibition corresponds to the case of edges weighted with a negative factor and whose effect is to lower the firing threshold when a 1 is transmitted through this edge. Tuesday, December 10, 2013 66
  67. 67. 1. Explain the logic functions (using truth tables) performed by the following networks with MP neurons The neurons fire when the input is greater than the threshold. Tuesday, December 10, 2013 67
  68. 68. (a) Tuesday, December 10, 2013 (b) (c) 68
  69. 69. 2. Design networks using M-P neurons to realize the following logic functions using ± 1 for the weights. a) s(a1, a2, a3) = a1 a2 a3 b) s(a1, a2, a3) = ~ a1 a2~ a3 c) s(a1, a2, a3) = a1 a3 + a2 a3 + ~ a1 ~ a3 Tuesday, December 10, 2013 69
  70. 70. (a) (c) (b) Tuesday, December 10, 2013 70
  71. 71. Detecting Hot and Cold • If we touch something hot we will perceive heat • If we touch something cold we perceive heat • If we keep touching something cold we will perceive cold  To model this we will assume that time is discrete  If cold is applied for one time step then heat will be perceived  If a cold stimulus is applied for two time steps then cold will be perceived  If heat is applied then we should perceive heat. Tuesday, December 10, 2013 71
  72. 72. Heat Cold x1 x2 Y1 Y2 • The desired response of the system is that cold is perceived if a cold stimulus is applied for two time steps, i.e., y2(t) = x2(t-2) AND x2(t-1) Tuesday, December 10, 2013 72
  73. 73. • Heat be perceived if either a hot stimulus is applied or a cold stimulus is applied briefly (for one time step) and then removed. y1(t) = {x1(t-1)} OR {x2(t-3) AND NOT x2(t-2)} 73 Tuesday, December 10, 2013
  74. 74. Hot & Cold 75 Tuesday, December 10, 2013
  75. 75. Cold Stimulus (one step) 76 Tuesday, December 10, 2013
  76. 76. t=1 77 Tuesday, December 10, 2013
  77. 77. t=2 78 Tuesday, December 10, 2013
  78. 78. t=3 79 Tuesday, December 10, 2013
  79. 79. Cold Stimulus (two step) 80 Tuesday, December 10, 2013
  80. 80. t=1 81 Tuesday, December 10, 2013
  81. 81. t=2 82 Tuesday, December 10, 2013
  82. 82. Hot Stimulus (one step) 83 Tuesday, December 10, 2013
  83. 83. t=1 84 Tuesday, December 10, 2013
  84. 84. Tuesday, December 10, 2013 85
  85. 85. Recurrent networks  Neural networks were designed on analogy with the brain.  The brain’s memory, however, works by association. o For example, we can recognize a familiar face even in an unfamiliar environment within 100-200 ms. o We can also recall a complete sensory experience, including sounds and scenes, when we hear only a few bars of music. The brain routinely associates one thing with another. To emulate the human memory‟s associative characteristics we need a different type of network: a recurrent neural network. Tuesday, December 10, 2013 86
  86. 86. A recurrent neural network has feedback loops from its outputs to its inputs. The presence of such loops has a profound impact on the learning capability of the network.  McCulloch–Pitts units can be used in recurrent networks by introducing a temporal factor in the computation.  It is assumed that computation of the activation of each unit consumes a time unit. o If the input arrives at time t the result is produced at time t + 1.  Care needs to be taken to coordinate the arrival of the input values at the nodes. o This could make the introduction of additional computing elements necessary, whose sole mission is to insert the necessary delays for the coordinated arrival of information.  This is the same problem that any computer with clocked Tuesday, elements has to deal with. 87 December 10, 2013
  87. 87. Design a network that processes a sequence of bits, giving off one bit of output for every bit of input, but in such a way that any two consecutive ones are transformed into the sequence 10. E.g. The binary sequence 00110110 is transformed into the sequence 00100100. Tuesday, December 10, 2013 88
  88. 88. 1. Design a McCulloch–Pitts unit capable of recognizing the letter “T” digitized in a 10 × 10 array of pixels. Dark pixels should be coded as ones, white pixels as zeroes. 2. Build a recurrent network capable of adding two sequential streams of bits of arbitrary finite length. 3. The parity of n given bits is 1 if an odd number of them is equal to 1, otherwise it is 0. Build a network of McCulloch–Pitts units capable of computing the parity function of two, three, and four given bits. Tuesday, December 10, 2013 89
  89. 89. Learning algorithms for NN  A learning algorithm is an adaptive method by which a network of computing units selforganizes to implement the desired behavior.  This is done by presenting some examples of the desired input output mapping to the network. o A correction step is executed iteratively until the network learns to produce the desired response.  The learning algorithm is a closed loop of presentation of examples and of corrections to the network parameters Tuesday, December 10, 2013 90
  90. 90. Learning process in a parametric system  In some simple cases the weights for the computing units can be found through a sequential test of stochastically generated numerical combinations.  However, such algorithms which look blindly for a solution do not qualify as “learning”.  A learning algorithm must adapt the network parameters according to previous experience until a solution is found, if Tuesday, it exists. 91 December 10, 2013
  91. 91. Classes of learning algorithms 1. Supervised  Supervised learning denotes a method in which some input vectors are collected and presented to the network. The output computed by the network is observed and the deviation from the expected answer is measured.  The weights are corrected according to the magnitude of the error in the way defined by the learning algorithm.  This kind of learning is also called learning with a teacher, since a control process knows the correct answer for the set of selected input vectors. Tuesday, December 10, 2013 92
  92. 92. Classes of learning algorithms 2. Unsupervised  Unsupervised learning is used when, for a given input, the exact numerical output a network should produce is unknown.  In this case we do not know a priori which unit is going to specialize on which cluster. Generally we do not even know how many well-defined clusters are present. Since no “teacher” is available, the network must organize itself in order to be able to associate clusters with units. Tuesday, December 10, 2013 93
  93. 93. If the model fits the training data too well (extreme case: model duplicates teacher data exactly), it has only "learnt the training data by heart" and will not generalize well.  Particularly important with small training samples. Statistical learning theory addresses this problem.  For RNN training, however, this tended to be a non-issue, because known training methods have a hard time fitting training data well in the first place. Tuesday, December 10, 2013 94
  94. 94. Types of Supervised learning algorithms 1. Reinforcement learning Used when after each presentation of an input-output example we only know whether the network produces the desired result or not. The weights are updated based on this information (that is, the Boolean values true or false) so that only the input vector can be used for weight correction. 2. Learning with error correction The magnitude of the error, together with the input vector, determines the magnitude of the corrections to the weights, and in many cases we try to eliminate the error in a single correction step. Tuesday, December 10, 2013 95
  95. 95. Classes of learning algorithms Tuesday, December 10, 2013 96
  96. 96. Simplest form of NN needed for classification of linearly separable patterns. By Rosenblatt (1962) Tuesday, December 10, 2013 97
  97. 97. Perceptrons can learn many boolean functions: AND, OR, NAND, NOR, but not XOR Are AND & OR functions linearly separable? What about XOR? o x x x x o o o o x o x x: class I (y = 1) o: class II (y = -1) Tuesday, December 10, 2013 x: class I (y = 1) o: class II (y = -1) x: class I (y = 1) o: class II (y = -1) 98
  98. 98. XOR However, every boolean function can be represented with a perceptron network that has two levels of depth or more. Tuesday, December 10, 2013 99
  99. 99. Perceptron Learning  How does a perceptron acquire its knowledge?  The question really is: How does a perceptron learn the appropriate weights? Tuesday, December 10, 2013 100
  100. 100. 1. Assign random values to the weight vector 2. Apply the weight update rule to every training example 3. Are all training examples correctly classified? a. Yes. Quit b. No. Go back to Step 2. Tuesday, December 10, 2013 101
  101. 101. There are two popular weight update rules. i) The perceptron rule, and ii) Delta rule Tuesday, December 10, 2013 102
  102. 102. We start with an e.g. •Consider the features: Taste Seeds Skin Sweet = 1, Not_Sweet = 0 Edible = 1, Not_Edible = 0 Edible = 1, Not_Edible = 0 For output: Good_Fruit = 1 Not_Good_Fruit = 0 Tuesday, December 10, 2013 103
  103. 103. • Let’s start with no knowledge: • The weights are empty: Input Taste 0.0 Output Seeds 0.0 0.0 Skin Tuesday, December 10, 2013 If ∑ > 0.4 then fire 104
  104. 104.  To train the perceptron, we will show it with example and have it categorize each one.  Since it’s starting with no knowledge, it is going to make mistakes. When it makes a mistake, we are going to adjust the weights to make that mistake less likely in the future.  When we adjust the weights, we’re going to take relatively small steps to be sure we don’t over-correct and create new problems.  It’s going to learn the category “good fruit” defined as anything that is sweet & either skin or seed is edible. • Good fruit = 1 • Not good fruit = 0 Tuesday, December 10, 2013 105
  105. 105. Banana is Good: Input Taste 1 1 0.0 Output Seeds 1 1 0.0 0.0 Skin 0 0 0 Teacher 1 If ∑ > 0.4 then fire What will be the output? Tuesday, December 10, 2013 106
  106. 106. • In this case we have: – (1 X 0) = 0 + (1 X 0) = 0 + (0 X 0) = 0 • It adds up to 0.0 • Since that is less than the threshold (0.40), the response was“no”, which is incorrect. • Since we got it wrong, we know we need to change the weights. • ∆w = learning rate x (overall teacher - overall output) x node output Tuesday, December 10, 2013 107
  107. 107. • The three parts of that are: – Learning rate: We set that ourselves. It should be large enough that learning happens in a reasonable amount of time, but small enough that it doesn’t go too fast. Let’s take it as 0.25. – (overall teacher - overall output): The teacher knows the correct answer (e.g., that a banana should be a good fruit). In this case, the teacher says 1, the output is 0, so (1 - 0) = 1. – node output: That’s what came out of the node whose weight we’re adjusting. For the first node, 1. Tuesday, December 10, 2013 108
  108. 108. • To put it together: – Learning rate: 0.25. – (overall teacher - overall output): 1. – node output: 1. • ∆w = 0.25 x 1 x 1 = 0.25 • Since it’s a ∆w, it’s telling us how much to change the first weight. In this case, we’re adding 0.25 to it. Tuesday, December 10, 2013 109
  109. 109. Analysis of Delta Rule • (overall teacher - overall output): – If we get the categorization right, (overall teacher - overall output) will be zero (the right answer minus itself). – In other words, if we get it right, we won’t change any of the weights. As far as we know we have a good solution, why would we change it? Tuesday, December 10, 2013 110
  110. 110. • (overall teacher - overall output): – If we get the categorization wrong, (overall teacher - overall output) will either be -1 or +1. • If we said “yes” when the answer was “no,” we’re too high on the weights and we will get a (teacher - output) of -1 which will result in reducing the weights. • If we said “no” when the answer was “yes,” we’re too low on the weights and this will cause them to be increased. Tuesday, December 10, 2013 111
  111. 111. • Node output: – If the node whose weight we’re adjusting sent in a 0, then it didn’t participate in making the decision. In that case, it shouldn't be adjusted. Multiplying by zero will make that happen. – If the node whose weight we’re adjusting sent in a 1, then it did participate and we should change the weight (up or down as needed). Tuesday, December 10, 2013 112
  112. 112. How do we change the weights for banana? Feature: taste seeds skin Learning (overall teacher – Node rate: overall output) output: 0.25 1 1 0.25 1 1 0.25 1 0 ∆w +0.25 +0.25 0 • To continue training, we show it the next example, adjust the weights… • We will keep cycling through the examples until we go all the way through one time without making any changes to the weights. At that point, the concept is learned. Tuesday, December 10, 2013 113
  113. 113. Pear is good: Input Taste 1 1 0.25 Output Seeds 0 0 0.25 0.0 Skin 1 1 0 Teacher 1 If ∑ > 0.4 then fire What will be the output? Tuesday, December 10, 2013 114
  114. 114. How do we change the weights for pear? Feature Learning : rate: taste seeds skin Tuesday, December 10, 2013 0.25 0.25 0.25 (overall teacher overall output): 1 1 1 Node output: ∆w 1 0 1 +0.25 0 +0.25 115
  115. 115. Lemon not sweet: Input Taste 0 0 0.50 Output Seed s 0 Skin 0 0 0.25 0 0 0.25 Teacher 0 If ∑ > 0.4 then fire • Do we change the weights for lemon? • Since (overall teacher - overall output)=0, there will be no change in the weights. Tuesday, December 10, 2013 116
  116. 116. Guava is good: Input Taste 1 1 0.50 Output Seeds 1 1 0.25 Skin 1 1 1 0.25 Teacher 1 If ∑ > 0.4 then fire If you keep going, you will see that this perceptron can correctly classify the examples that we have. Tuesday, December 10, 2013 117
  117. 117. Perceptron Rule put mathematically: For a new training example X = (x1, x2, …, xn), update each weight according to this rule: where Δwi = η (t-o) xi t: target output o: output generated by the perceptron η: constant called the learning rate (e.g., 0.1) Tuesday, December 10, 2013 118
  118. 118. How Do Perceptrons Learn? What will be the output if the threshold is 1.2?  1 * 0.5 + 0 * 0.2 + 1 * 0.8 =1.3  Threshold = 1.2 & 1.3 > 1.2  So, o/p is 1 Assume Output was supposed to be 0. If α = 1, (α is the learning rate) what will be the new weights? Tuesday, December 10, 2013 119
  119. 119.  If the example is correctly classified the term (t-o) equals zero, and no update on the weight is necessary.  If the perceptron outputs 0 and the real answer is 1, the weight is increased.  If the perceptron outputs a 1 and the real answer is 0, the weight is decreased. Tuesday, December 10, 2013 120
  120. 120. Consider the following set of input training vectors & the initial weight vector. 1 −2 x1 = 0 −1 0 1.5 x2 = −0.5 −1 −1 1 x3 = 0.5 −1 1 −1 w= 0 0.5 The learning constant c = 0.1 The teacher’s responses for x1, x2, x3 are d1 = -1, d2 = -1, d3 = 1. Train the perceptron using Perceptron Learning rule. Tuesday, December 10, 2013 121
  121. 121. 1 −2 net1 = 1 −1 0 0.5 = 2.5 0 −1 O1 = ? O1 = sgn(2.5) = 1 & d1 = -1 Δwi = η (t-o) xi w1 = w + Δw1 1 1 0.8 −1 −0.6 −2 w1 = + 0.1 −1 − 1 = 0 0 0 0.5 0.7 −1 Tuesday, December 10, 2013 122
  122. 122. net 2 = 0.8 −0.6 0 0.7 0 1.5 −0.5 −1 = −1.6 Will correction be required? No correction, since o2 = sgn(-1.6) = -1 = d2 net 3 = 0.8 −0.6 0 0.7 −1 1 0.5 −1 = −2.1 Will correction be required? Yes, since o3 = sgn(-2.1) = -1 while d2 = 1 Tuesday, December 10, 2013 123
  123. 123. 0.8 −1 0.6 −0.6 −0.4 1 w3 = + 0.1 1 + 1 = 0 0.5 0.1 0.7 0.5 −1 Tuesday, December 10, 2013 124
  124. 124. Strength:  If the data is linearly separable and is set to a sufficiently small value, it will converge to a hypothesis that classifies all training data correctly in a finite number of iterations Weakness:  If the data is not linearly separable, it will not converge Tuesday, December 10, 2013 125
  125. 125.  Developed by Widrow and Hoff, the delta rule, also called the Least Mean Square (LMS)  Although the perceptron rule finds a successful weight vector when the training examples are linearly separable, it can fail to converge if the examples are not linearly separable.  The Delta rule, is designed to overcome this difficulty.  The key idea of delta rule: to use gradient descent to search the space of possible weight vector to find the weights that best fit the training examples. Tuesday, December 10, 2013 126
  126. 126. Tuesday, December 10, 2013 127
  127. 127.  Linear units are like perceptrons, but the output is used directly (not thresholded to 1 or -1)  A linear unit can be thought of as an unthresholded perceptron  The output of an k-input linear unit is (the output is a real value, not binary)  It isn't reasonable to use a boolean notion of error for linear units, so we need to use something else. Tuesday, December 10, 2013 128
  128. 128.  Consider the task of training an unthresholded perceptron, that is a linear unit, for which the output o is given by: o = w0 + w1x1 + ··· + wnxn  We will use a sum-of-squares measure of error E, under hypothesis (weights) (w0; … ;wk-1) and training set D:  td is training example d's output value  od is the output of the linear unit under d's inputs Tuesday, December 10, 2013 129
  129. 129. Hypothesis Space  To understand the gradient descent algorithm, it is helpful to visualize the entire space of possible weight vectors and their associated E values, as illustrated on the next slide. – Here the axes wo,w1 represents possible values for the two weights of a simple linear unit. The wo,w1 plane represents the entire hypothesis space. – The vertical axis indicates the error E relative to some fixed set of training examples. The error surface shown in the figure summarizes the desirability of every weight vector in the hypothesis space.  For linear units, this error surface must be parabolic with a single global minimum. And we desire a weight vector with this minimum. Tuesday, December 10, 2013 130
  130. 130. The error surface How can we calculate the direction of steepest descent along the error surface? This direction can be found by computing the derivative of E w.r.t. each component of the vector w. Tuesday, December 10, 2013 131
  131. 131. Tuesday, December 10, 2013 132
  132. 132. • This vector derivative is called the gradient of E with respect to the vector <w0,…,wn>, written E . E is itself a vector, whose components are the partial derivatives of E with respect to each of the wi. Tuesday, December 10, 2013 133
  133. 133.  When interpreted as a vector in weight space, the gradient specifies the direction that produces the steepest increase in E.  The negative of this vector therefore gives the direction of steepest decrease.  Since the gradient specifies the direction of steepest increase of E, the training rule for gradient descent is w w + w where  Here is a positive constant called the learning rate, which determines the step size in the gradient descent search. Tuesday, December 10, 2013 134
  134. 134. By Chain rule we get W 2(d f f) s • The problem: f / s is not differentiable • Three solutions: – Ignore It: The Error-Correction Procedure W – Fudge It: Widrow-Hoff – Approximate it: The Generalized Delta Procedure Tuesday, December 10, 2013 2(d f )X 135
  135. 135. How to update W?? Incremental learning : adjust W that slightly reduces e for one Xi (weights change after the outcome of each sample) Batch learning : adjust W that reduces e for all Xi (single weight adjustment) Tuesday, December 10, 2013 136
  136. 136. W 2(d f f) s After all the mathematical jugglery, we get the following result from the two equations given above  Incremental learning : for kth sample 𝜕𝑓 ∆wik = η dk − fk 𝑥𝑖 𝜕𝑠  Batch learning : the neuron weight is changed after all the patterns have been applied p ∆wi = η Tuesday, December 10, 2013 dk − fk k=1 𝜕𝑓 𝑥𝑖 𝜕𝑠 137
  137. 137. • The gradient descent algorithm for training linear units is as follows: Pick an initial random weight vector. Apply the linear unit to all training examples, then compute wi for each weight. Update each weight wi by adding wi , then repeat the process. • Because the error surface contains only a single global minimum, this algorithm will converge to a weight vector with minimum error, regardless of whether the training examples are linearly separable, given a sufficiently small is used. • If is too large, the gradient descent search runs the risk of overstepping the minimum in the error surface rather than settling into it. For this reason, one common modification to the algorithm is to gradually reduce the value of as the number of gradient descent steps grows. Tuesday, December 10, 2013 138
  138. 138. Tuesday, December 10, 2013 139
  139. 139. Summarizing all the key factors involved in Gradient Descent Learning:  The purpose of neural network learning or training is to minimize the output errors on a particular set of training data by adjusting the network weights wij.  We define an appropriate Error Function E(wij) that “measures” how far the current network is from the desired one.  Partial derivatives of the error function ∂E(wij)/∂wij tell us which direction we need to move in weight space to reduce the error.  The learning rate η specifies the step sizes we take in weight space for each iteration of the weight update equation.  We keep stepping through weight space until the errors are “small enough”.  If we choose neuron activation functions with derivatives that take on particularly simple forms, we can make the weight update computations very efficient.  These factors lead to powerful learning algorithms for training neural networks. Tuesday, December 10, 2013 140
  140. 140. Consider the following set of input training vectors & the initial weight vector. 1 −2 x1 = 0 −1 0 1.5 x2 = −0.5 −1 −1 1 x3 = 0.5 −1 1 −1 w= 0 0.5 The learning constant c = 0.1 The teacher’s responses for x1, x2, x3 are d1 = -1, d2 = -1, d3 = 1. Train the perceptron using Delta rule. Take f / s = ½ (1 – Tuesday, December 10, 2013 o2) & 2 f ( x) 1 e x 1 141
  141. 141. net1 = 1 −1 0 0.5 O1 = ? f/ s=? 1 −2 0 −1 = 2.5 2 o1 = − 1 = 0.848 −2.5 1+e f / s = ½ (1 – o2) = 0.140 𝜕𝑓 ∆wik = η dk − fk 𝑥𝑖 𝜕𝑠 Tuesday, December 10, 2013 142
  142. 142. 1 0.974 1 −1 −0.948 −2 w1 = + 0.1 −1 − 0.848 0.140 = 0 0 0 0.5 0.526 −1 net2 = -1.948 W2 = [0.974 net3 = -2.46 W3 = [0.947 Tuesday, December 10, 2013 o2 = -0.75 -0.956 f / s = 0.218 0.002 0.531] o3 = -0.842 -0.929 0.016 f / s = 0.145 0.505] 143
  143. 143. Determine the weights of a network with 4 input and 2 output units using (a) Perceptron learning law and (b) Delta learning law with f(x) = l/(1 +e-x) for the following input output pairs: Input: [1100] [1001] [0011] [0110] Output: [11] [10] [01] [00] Tuesday, December 10, 2013 Take f / s = ½ (1 – o2) & 144
  144. 144.  The perceptron learning rule and the LMS learning algorithm have been designed to train a single-layer network.  These single-layer networks suffer from the disadvantage that they are only able to solve linearly separable classification problems.  The multilayer perceptron (MLP) is a hierarchical structure of several perceptrons, & overcomes the disadvantages of these single-layer networks. Tuesday, December 10, 2013 146
  145. 145.  No connections within a layer  No direct connections between input and output layers  Fully connected between layers  Often more than 3 layers  Number of output units need not equal number of input units  Number of hidden units per layer can be more or less than input or output units  Each unit is a perceptron Tuesday, December 10, 2013 147
  146. 146. An example of a three-layered multilayer neural network with two-layers of hidden neurons Tuesday, December 10, 2013 148
  147. 147. Multilayered networks are capable of computing a wider range of Boolean functions than networks with a single layer of computing units. Tuesday, December 10, 2013 149
  148. 148. A special requirement The training algorithm for multilayer networks requires differentiable, continuous nonlinear activation functions.  Such a function is the sigmoid, or logistic function: a = σ ( n ) = 1 / ( 1 + e-cn ) where n is the sum of products from the weights wi and the inputs xi. c is a constant,  Another nonlinear function often used in practice is the hyperbolic tangent: a = tanh( n ) = ( en - e-n ) / (en + e-n) Tuesday, December 10, 2013 150
  149. 149. ∆ A feed-forward neural network is a computational graph whose nodes are computing units and whose directed edges transmit numerical information from node to node. ∆ Each computing unit is capable of evaluating a single primitive function of its input. ∆ In fact the network represents a chain of function compositions which transform an input to an output vector (called a pattern). ∆ The learning problem consists of finding the optimal combination of weights so that the network function ϕ approximates a given function f as closely as possible. ∆ However, we are not given the function f explicitly but only implicitly through some examples. Tuesday, December 10, 2013 151
  150. 150. ∆ Consider a feed-forward network with n input and m output units. It can consist of any number of hidden units. ∆ We are also given a training set {(x1, t1), …, (xp, tp)} consisting of p ordered pairs of n- and m-dimensional vectors, which are called the input and output patterns. ∆ Let the primitive functions at each node of the network be continuous and differentiable. ∆ The weights of the edges are real numbers selected at random. When the input pattern xi from the training set is presented to this network, it produces an output oi different in general from the target ti. Tuesday, December 10, 2013 152
  151. 151. Tuesday, December 10, 2013 153
  152. 152. ∆ It is required to make oi and ti identical for i= 1,...,p, by using a learning algorithm. ∆ More precisely, we want to minimize the error function of the network, defined as ∆ After minimizing this function for the training set, new unknown input patterns are presented to the network and we expect it to interpolate. The network must recognize whether a new input vector is similar to learned patterns and produce a similar output. Tuesday, December 10, 2013 154
  153. 153. ∆ The Back Propagation (BP) algorithm is used to find a local minimum of the error function. ∆ The network is initialized with randomly chosen weights. ∆ The gradient of the error function is computed and used to correct the initial weights. ∆ E is a continuous and differentiable function of the weights w1,w2,...,wl in the network. ∆ We can thus minimize E by using an iterative process of gradient descent, for which we need to calculate the gradient ∆ Each weight is updated using the increment Tuesday, December 10, 2013 155
  154. 154. MLP became applicable on practical tasks after the discovery of a supervised training algorithm for learning their weights, this is the backpropagation learning algorithm. The back propagation algorithm for training multilayer neural networks is a generalization of the LMS training procedure for nonlinear logistic outputs. As with the LMS procedure, training is iterative with the weights adjusted after the presentation of each example. Feedback Path Back Propagation Algorithm Network Output Layer Hidden Layer Hidden Layer Network Inputs Input Layer The back propagation algorithm includes two passes through the network: - forward pass and - backward pass. Network Outputs Desired Output Training Set 156
  155. 155. Multilayer Network Structure: Input Layer σ σ σ p1 Inputs Hidden Layers Output Layer σ σ a1 p2 Outputs σ p3 wji σ σ a2 wlk σ wkj σ σ is sigmoid function 157
  156. 156. Network is equivalent to a complex chain of function compositions Nodes of the network are given a composite structure Tuesday, December 10, 2013 158
  157. 157. Each node now consists of a left and a right side  The right side computes the primitive function associated with the node,  The left side computes the derivative of this primitive function for the same input. Tuesday, December 10, 2013 159
  158. 158. The integration function can be separated from the activation function by splitting each node into two parts.  The first node computes the sum of the incoming inputs,  The second one the activation function s.   The derivative of s is s’ and the partial derivative of the sum of n arguments with respect to any one of them is just 1. This separation simplifies the discussion, as we only have to think of a single function which is being computed at each node and not of two. Tuesday, December 10, 2013 160
  159. 159. 1. The Feed-forward step  A training input pattern is presented to the network input layer. The network propagates the input pattern from layer to layer until the output pattern is generated by the output layer.  Information comes from the left and each unit evaluates its primitive function f in its right side as well as the derivative f ’ in its left side.  Both results are stored in the unit, but only the result from the right side is transmitted to the units connected to the right. Tuesday, December 10, 2013 161
  160. 160. In the feed-forward step, incoming information into a unit is used as the argument for the evaluation of the node‟s primitive function and its derivative. In this step the network computes the composition of the functions f and g. The correct result of the function composition has been produced at the output unit and each unit has stored some information on its left side. Tuesday, December 10, 2013 162
  161. 161. 2. The Backpropagation step  If this pattern is different from the desired output, an error is calculated and then propagated backwards through the network from the output layer to the input layer.  The stored results are now used.  The weights are modified as the error is propagated. Tuesday, December 10, 2013 163
  162. 162. The backpropagation step provides an implementation of the chain rule. Any sequence of function compositions can be evaluated in this way and its derivative can be obtained in the backpropagation step. We can think of the network as being used backwards with the input 1, whereby at each node the product with the value stored in the left side is computed. Tuesday, December 10, 2013 164
  163. 163. Two kinds of signals pass through these networks: - function signals: the input examples propagated through the hidden units and processed by their transfer functions emerge as outputs; - error signals: the errors at the output nodes are propagated backward layer-by-layer through the network so that each node returns its error back to the nodes in the previous hidden layer. 165
  164. 164. Goal: minimize sum squared errors Err1=y1-o1 E Err2=y2-o2 1 2 ( yi oi ) 2 i oi Erri=yi-oi Erro=yo-oo How to compute the errors for the hidden units? parameterized function of inputs: weights are the parameters of the function. Clear error at the output layer We can back-propagate the error from the output layer to the hidden layers. The back-propagation process emerges directly from a derivation of the overall error gradient. Tuesday, December 10, 2013 166
  165. 165. Backpropagation Learning Algorithm for MLP Perceptron update: Erri=yi-oi Wkj k j Wji oi Output layer weight update (similar to perceptron) Hidden node j is “responsible” for some fraction of the error i in each of the output nodes to which it connects  depending on the strength of the connection between the Tuesday, 167 hidden node and the output node i. December 10, 2013
  166. 166.  Like perceptron learning, BP attempts to reduce the errors between the output of the network and the desired result.  However, assigning blame for errors to hidden nodes, is not so straightforward. The error of the output nodes must be propagated back through the hidden nodes.  The contribution that a hidden node makes to an output node is related to the strength of the weight on the link between the two nodes and the level of activation of the hidden node when the output node was given the wrong level of activation.  This can be used to estimate the error value for a hidden node in the penultimate layer, and that can, in turn, be used in making error estimates for earlier layers. Tuesday, December 10, 2013 168
  167. 167. The basic algorithm can be summed up in the following equation (the delta rule) for the change to the weight wij from node i to node j: Weight change Δwij Tuesday, December 10, 2013 learning local rate gradient = η × δj input signal to node j × yi 169
  168. 168. The local gradient δj is defined as follows:  Node j is an output node δj is the product of f'(netj) and the error signal ej, where f(_) is the logistic function and netj is the total input to node j (i.e. Σi wijyi), and ej is the error signal for node j (i.e. the difference between the desired output and the actual output);  Node j is a hidden node δj is the product of f'(netj) and the weighted sum of the δ's computed for the nodes in the next hidden or output layer that are connected to node j. Tuesday, December 10, 2013 170
  169. 169. Stopping Criterion  stop after a certain number of runs through all the training data (each run through all the training data is called an epoch);  stop when the total sum-squared error reaches some low level. By total sum-squared error we mean ΣpΣiei2 where p ranges over all of the training patterns and i ranges over all of the output units. Tuesday, December 10, 2013 171
  170. 170. Find the new weights when the following network is presented the input pattern [0.6 0.8 0]. The target output is 0.9. Use learning rate = 0.3 & binary sigmoid activation function. Tuesday, December 10, 2013 172
  171. 171. Step 1 Find the inputs at each of the hidden units. netz1 = 0 + 0.6 x 2 + 0.8 x 1 + 0 x 0 = 2 So, we get netz1 = 2 netz2 = 2.2 netz3 = 0.6 (since bias = -1) Tuesday, December 10, 2013 173
  172. 172. Step 2 Find the output of each of the hidden unit. So, we get oz1 = 0.8808 oz2 = 0.9002 oz3 = 0.646 Tuesday, December 10, 2013 174
  173. 173. Step 3 Find the input to output unit Y. nety = -1 + 0.8808 x -1 + 0.9002 x 1 + 0.646 x 2 nety = 0.3114 Step 4 Find the output of the output unit. oy = 0.5772 Tuesday, December 10, 2013 175
  174. 174. Step 5 Find the gradient at the output unit Y. δ1 = (t1 – oy) f′(nety) We know that for a binary sigmoid function f′(x) = f(x)(1 – f(x)) So, f′(nety) = 0.5772 (1 – 0.5772) = 0.244 δ1 = (0.9 – 0.5772) 0.244 δ1 = 0.0788 Tuesday, December 10, 2013 176
  175. 175. Step 6 Find the gradient at the hidden units. Remember: If node j is a hidden node, then δj is the product of f'(netj) and the weighted sum of the δ's computed for the nodes in the next hidden or output layer that are connected to node j. δz1 = δ1 w11 f′(netz1) δz1 = 0.0788 x -1 x 0.8808 x (1 – 0.8808) δz1 = - 0.0083 δz2 = 0.0071 δz3 = 0.0361 Tuesday, December 10, 2013 177
  176. 176. Step 7 Weight updation at the hidden units. Weight change Δwij Tuesday, December 10, 2013 learning local rate gradient = η × δj input signal to node j × yi 178
  177. 177. Δv11 = δz1 x1 = 0.3 x -0.0083 x 0.6 = -0.0015 Δv12 = δz2 x1 = 0.3 x 0.0071 x 0.6 = 0.0013 Δv13 = δz3 x1 = 0.3 x 0.0361 x 0.6 = 0.0065 Δv21 = δz1 x2 = 0.3 x -0.0083 x 0.8 = -0.002 Δv22 = δz2 x2 = 0.3 x 0.0071 x 0.8 = 0.0017 Δv23 = δz3 x2 = 0.3 x 0.0361 x 0.8 = 0.0087 Δv31 = δz1 x3 = 0.3 x -0.0083 x 0.0 = 0.0 Δv32 = δz2 x3 = 0.3 x 0.0071 x 0.0 = 0.0 Δv33 = δz3 x3 = 0.3 x 0.0361 x 0.0 = 0.0 Δw11 = δ1 z1 = 0.3 x 0.0788 x 0.8808 = 0.0208 Δw21 = δ1 z2 = 0.3 x 0.0788 x 0.9002 = 0.0212 Tuesday, December 10, 2013 Δw31 = δ1 z3 = 0.3 x 0.0788 x 0.6460 = 0.0153 179
  178. 178. v11(new) = v11(old) + Δv11 = 2 - 0.0015 = 1.9985 v12(new) = 1.0013 v13(new) = 0.0065 v21 (new)= 0.998 v22 (new)= 2.0017 v23 (new)= 2.0087 v31 (new)= 0 v32 (new)= 3 v33 (new)= 1 w11 (new)= 0.9792 w21 (new)= 1.0212 Tuesday, December 10, 2013 w31 (new)= 2.0153 180
  179. 179. Three-layer network for solving the Exclusive-OR operation 1 3 x1 1 w13 3 1 w35 w23 5 5 w14 x2 2 w45 4 w24 Input layer Tuesday, December 10, 2013 y5 4 1 Hidden layer Output layer 181
  180. 180.   The effect of the threshold applied to a neuron in the hidden or output layer is represented by its weight, , connected to a fixed input equal to 1. The initial weights and threshold levels are set randomly as follows: w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0, w35 = 1.2, w45 = 1.1, 3 = 0.8, 4 = 0.1 and 5 = 0.3. Tuesday, December 10, 2013 182
  181. 181.  We consider a training set where inputs x1 and x2 are equal to 1 and desired output yd,5 is 0. The actual outputs of neurons 3 and 4 in the hidden layer are calculated as y3 sigmoid ( x1w13 x2 w23 ) 1 / 1 e (1 0.5 1 0.4 1 0.8) 3 0.5250 y4 sigmoid ( x1w14 ) 1 / 1 e (1 0.9 1 1.0 1 0.1) 4 0.8808  Now the actual output of neuron 5 in the output layer is determined as: y5  e x2 w24 sigmoid ( y3w35 y4 w45 5 ) 1/ 1 e ( 0.52501.2 0.88081.1 1 0.3) 0.5097 Thus, the following error is obtained: y d ,5 Tuesday, December 10, 2013 y5 0 0.5097 0.5097 183
  182. 182.   The next step is weight training. To update the weights and threshold levels in our network, we propagate the error, e, from the output layer backward to the input layer. First, we calculate the error gradient for neuron 5 in the output layer: y5 (1 y5 ) e 0.5097 (1 0.5097) ( 0.5097) 5  0.1274 Then we determine the weight corrections assuming that the learning rate parameter, , is equal to 0.1: w35 w45 5 Tuesday, December 10, 2013 y3 y4 ( 1) 5 5 5 0.1 0.5250 ( 0.1274 ) 0.0067 0.1 0.8808 ( 0.1274 ) 0.0112 0.1 ( 1) ( 0.1274 ) 0.0127 184
  183. 183.  Next we calculate the error gradients for neurons 3 and 4 in the hidden layer: 3 4  y3 (1 y3 ) y4 (1 y4 ) w35 5 5 w45 0.5250 (1 0.5250) ( 0.1274) ( 1.2) 0.0381 0.8808 (1 0.8808) ( 0.127 4) 1.1 0.0147 We then determine the weight corrections: w13 w23 3 w14 w24 4 Tuesday, December 10, 2013 x1 x2 ( 1) x1 x2 ( 1) 3 3 3 4 4 4 0.1 1 0.0381 0.0038 0.1 1 0.0381 0.0038 0.1 ( 1) 0.0381 0.0038 0.1 1 ( 0.0147 ) 0.0015 0.1 1 ( 0.0147 ) 0.0015 0.1 ( 1) ( 0.0147 ) 0.0015 185
  184. 184.  At last, we update all weights and threshold: w 13 w 14 w w w w 23 24 35 45 w 13 w 14 w w w w w 13 w 14 w 23 w 24 w 35 w 45 3 3 3 4 4 4 5  23 5 5 24 35 45 0 .5 0 .0038 0 .5038 0 .9 0 .0015 0 .8985 0 .4 0 .0038 0 .4038 1 .0 0 .0015 0 .9985 1 .2 0 .0067 1 .1 0 .0112 0 .8 0 .0038 0 .1 0 .0015 0 .3 0 .0127 1 .2067 1 .0888 0 .7962 0 .0985 0 .3127 The training process is repeated until the sum of squared errors is less than 0.001. Tuesday, December 10, 2013 186
  185. 185. Q. Generate a NN using BPN algorithm for XOR logic function. Tuesday, December 10, 2013 187
  186. 186. Radial Basis Function Networks (RBFN) consists of 3 layers  an input layer  a hidden layer  an output layer The hidden units provide a set of functions that constitute an arbitrary basis for the input patterns.  hidden units are known as radial centers and represented by the vectors c1, c2, …, ch  transformation from input space to hidden unit space is nonlinear whereas transformation from hidden unit space to output space is linear  dimension of each center for a p input network is p x 1 Tuesday, December 10, 2013 188
  187. 187.  Radial functions are a special class of function.  Their characteristic feature is that their response decreases or increases monotonically with distance from a central point.  The centre, the distance scale, and the precise shape of the radial function are parameters of the model.  In principle, they could be employed in any sort of model (linear or nonlinear) and any sort of network (single layer or multi layer). Tuesday, December 10, 2013 189
  188. 188. Radial Basis Function Network  There is one hidden layer of neurons with RBF activation functions describing local receptors.  There is one output node to combine linearly the outputs of the hidden neurons. Tuesday, December 10, 2013 190
  189. 189.  The radial basis functions in the hidden layer produces a significant non-zero response only when the input falls within a small localized region of the input space.  Each hidden unit has its own receptive field in input space. An input vector xi which lies in the receptive field for center cj , would activate cj and by proper choice of weights the target output is obtained. The output is given as wj : weight of jth center, Φ some radial function Tuesday, December 10, 2013 191
  190. 190. Here, z = ║x – cj║ The most popular radial function is Gaussian activation function. Tuesday, December 10, 2013 192
  191. 191. RBFN vs. Multilayer Network RBF NET MULTILAYER NET It has a single hidden layer It has multiple hidden layers The basic neuron model as well as the function of the hidden layer is different from that of the output layer The hidden layer is nonlinear but the output layer is linear Activation function of the hidden unit computes the Euclidean distance between the input vector and the center of that unit The computational nodes of all the layers are similar Tuesday, December 10, 2013 All the layers are nonlinear Activation function computes the inner product of the input vector and the weight of that unit 193
  192. 192. RBFN vs. Multilayer Network RBF NET MULTILAYER NET Establishes local mapping, hence capable of fast learning Constructs global approximations to I/O mapping Two-fold learning. Both the centers Only the synaptic weights have to (position and spread) and the weights be learned have to be learned MLPs separate classes via hyperplanes X2 RBF X1 Tuesday, December 10, 2013 RBFs separate classes via hyperspheres MLP X2 X1 194
  193. 193. • The training is performed by deciding on – How many hidden nodes there should be – The centers and the sharpness of the Gaussians • Two stages – In the 1st stage, the input data set is used to determine the parameters of the basis functions – In the 2nd stage, functions are kept fixed while the second layer weights are estimated ( Simple BP algorithm like for MLPs) Tuesday, December 10, 2013 195
  194. 194.  Training of RBFN requires optimal selection of the parameters vectors ci and wi, i = 1, …, h.  Both layers are optimized using different techniques and in different time scales.  Following techniques are used to update the weights and centers of a RBFN. o Pseudo-Inverse Technique o Gradient Descent Learning o Hybrid Learning Tuesday, December 10, 2013 196
  195. 195.  This is a least square problem. Assume a fixed radial basis functions e.g. Gaussian functions.  The centers are chosen randomly. The function is normalized i.e. for any x, ∑φi = 1.  The standard deviation (width) of the radial function is determined by an adhoc choice. Tuesday, December 10, 2013 197
  196. 196. 1. The width is fixed according to the spread of centers where h: number of centers, d: maximum distance between the chosen centers. σ =? Tuesday, December 10, 2013 198
  197. 197. 2. Calculate the output generated Φ = [φ1, φ2, …, φh] w = [w1, w2, …, wh]T Φw = yd, where yd is the desired output 3. Required weight vector is computed as w = Φ′ yd = (ΦT Φ)-1 ΦT yd Φ′ = (ΦT Φ)-1 ΦT is the pseudo-inverse of Φ This is possible only when ΦT Φ is non-singular. If this is singular, singular value decomposition is used to solve for w. Tuesday, December 10, 2013 199
  198. 198. E.g. EX-NOR problem The truth table and the RBFN architecture are given below: Choice of centers is made randomly from 4 input patterns. Tuesday, December 10, 2013 200
  199. 199. Output y = w1φ1 + w2φ2 + θ What do we get on applying the 4 training patterns? Pattern 1: w1 + w2e-2 + θ Pattern 2: w1e-1 + w2e-1 + θ Pattern 3: w1e-1 + w2e-1 + θ Pattern 4: w1e-2 + w2 + θ What are the matrices for Φ, w, yd ? Tuesday, December 10, 2013 201
  200. 200. One of the most popular approaches to update c and w, is supervised training by error correcting term which is achieved by a gradient descent technique. The update rule for center learning is Tuesday, December 10, 2013 202
  201. 201. After simplification, the update rule for center learning is: The update rule for the linear weights is: Tuesday, December 10, 2013 205
  202. 202. Some application areas of RNN:  control of chemical plants  control of engines and generators  fault monitoring, biomedical diagnostics and monitoring  speech recognition  robotics, toys and edutainment  video data analysis  man-machine interfaces Tuesday, December 10, 2013 206
  203. 203.  Need for Systems which can process time dependant data.  Especially for applications (like weather forecast) which involves prediction based on the past. Tuesday, December 10, 2013 207
  204. 204. • Feed forward networks: – Information only flows one way – One input pattern produces one output – No sense of time (or memory of previous state) • Recurrent networks – Nodes connect back to other nodes or themselves – Information flow is multidirectional – Sense of time and memory of previous state(s) • Biological nervous systems show high levels of recurrency (but feed-forward structures exists too) Tuesday, December 10, 2013 208
  205. 205. Depending on the density of feedback connections: • Total recurrent networks (Hopfield model) • Partial recurrent networks –With contextual units (Elman model, Jordan model) –Cellular networks (Chua model) Tuesday, December 10, 2013 209
  206. 206. What is a Hopfield Network ?? • According to Wikipedia, Hopfield net is a form of recurrent artificial neural network invented by John Hopfield. • Hopfield nets serve as content-addressable memory systems with binary threshold units. • They are guaranteed to converge to a local minimum, but convergence to one of the stored patterns is not guaranteed. Tuesday, December 10, 2013 210
  207. 207. What are HN (informally) •These are single layered recurrent networks •All the neurons in the network are fedback from all other neurons in the network •The states of neuron is either +1 or -1 instead of (1 and 0) in order to work correctly. A Hopfield network with four nodes Tuesday, December 10, 2013 •Number of the input nodes should always be equal to no of output nodes 211
  208. 208. • Recalling or Reconstructing corrupted patterns • Large-scale computational intelligence systems • Handwriting Recognition Software • Practical applications of HNs are limited because number of training patterns can be at most about 14% the number of nodes in the network. • If the network is overloaded -- trained with more than the maximum acceptable number of attractors -- then it won't converge to clearly defined attractors. Tuesday, December 10, 2013 212
  209. 209. • This network is capable of associating its input with one of the patterns stored in network‟s memory – How patterns are stored in memory? – How inputs are supplied to the network – WHAT IS THE TOPOLOGY OF THE NETWORK Tuesday, December 10, 2013 213
  210. 210. • The inputs of the Hopfield network are values x1,…,xN • -1 xi 1 • Hence, the vector x=[x1 …xN] represents a point from a hyper-cube Topology •Fully interconnected •Recurrent network •Weights are symmetric: wi,j=wj,i Tuesday, December 10, 2013 214
  211. 211. y1 Output from 1st neuron wi1 1 y2 Output from 2nd neuron wi2 … Output of ith neuron yi -1 wi,i-1 yi-1 Output from i-1st neuron i-th neuron wi,i+1 yi+1 Output from i+1st neuron … yN Output from Nth neuron Tuesday, December 10, 2013 wi,N 215
  212. 212. • Neuron is characterized by its state si • The output of the neuron is the function of the neuron’s state: yi=f(si) • The applied function f is soft limiter which effectively limits the output to the [-1,1] range • Neuron initialization – When an input vector x arrives to the network, the state of i-th neuron, i=1,…,N is initialized by the value of the i-th input: si=xi Tuesday, December 10, 2013 216
  213. 213. • Subsequently – While there is any change: si wi , j y j j i yi f si • Output of the network is vector y=[y1…yn] consisting of neuron outputs when the network stabilizes Tuesday, December 10, 2013 217
  214. 214. • The subsequent computation of the network will occur until the network does not stabilize • The network will stabilize when all the states of the neurons stay the same • IMPORTANT PROPERTY: – Hopfield’s network will ALWAYS stabilize after finite time Tuesday, December 10, 2013 218
  215. 215. • Assume that we want to memorize M different Ndimensional vectors * * 1 M x ,..., x – What does it mean “to memorize”? – It means: if a vector “similar” to one of memorized vectors is brought to the input of the Hopfield network the stored vector closest to it will appear at the output of the network Tuesday, December 10, 2013 219
  216. 216. The following can be proven…. • If the number M of memorized N-dimensional vectors is smaller than N/4ln(N) • Then we can set the weights of the network as: M x* x*T m m W MI m 1 • Where W contains weights of the network – a symmetric matrix with zeros on main diagonal – NONE of the neurons is connected to itself • Such that the vectors x * correspond to the stable states m of the network Tuesday, December 10, 2013 220
  217. 217. • If vector xm* is on the input of the Hopfield’s network – the same vector xm* will be on its output • If a vector “close” to vector xm* is on the input of the Hopfield’s network – The vector xm* will be on its output Hence… The Hopfield network memorizes by embedding knowledge into its weights Tuesday, December 10, 2013 221
  218. 218. • What is “close” – The output associated to input is one of stored vectors “closest” to the input – However, the notion of “closeness” is hard encoded in the weight matrix and we cannot have influence on it • Spurious states – Assume that we memorized M different patterns into a Hopfield network – The network may have more than M stable states – Hence the output may be NONE of the vectors that are memorized in the network – In other words: among the offered M choices, we could not decide Tuesday, December 10, 2013 222
  219. 219. • What if vectors xm* to be learned are not exact (contain error)? • In other words: – If we had two patterns representing class 1 and class 2, we could assign each pattern to a vector and learn the vectors – However, if we had 100 different patterns representing class 1, and 100 patterns representing class 2, we cannot assign one vector to each pattern Tuesday, December 10, 2013 223
  220. 220. W1,1 W2,1 Oa W1,2 W3,1 W1,3 W2,2 W3,2 Ob W2,3 There are various ways to train these kinds of networks like back propagation algorithm , recurrent learning algorithm, genetic algorithm Oc W3,3 But there is one very simple algorithm to train these simple networks called „One shot method‟. Tuesday, December 10, 2013 224
  221. 221. The method consists of a single calculation for each weight (so the whole network can be trained in “one pass”). The inputs are –1 and +1 (the neuron threshold is zero). • Lets train this network for following patterns • Pattern 1:• Pattern 2:• Pattern 3:- ie Oa(1)=-1,Ob(1)=-1,Oc(1)=1 ie Oa(2)=1, Ob(2)=-1, Oc(3)=-1 ie Oa(3)=-1,Ob(3)=1, Oc(3)=1 If you want to imagine this as an image then the –1 might represent a white pixel and the +1 a black one. Tuesday, December 10, 2013 225
  222. 222. The training is now simple.  We multiply the pixel in each pattern corresponding to the index of the weight, so for W1,2 we multiply the value of pixel 1 and pixel 2 together in each of the patterns we wish to train. We then add up the result. Tuesday, December 10, 2013 226
  223. 223. • Pattern 1: Oa(1)=-1,Ob(1)=-1,Oc(1)=1 • Pattern 2: Oa(2)=1, Ob(2)=-1, Oc(3)=-1 • Pattern 3: Oa(3)=-1,Ob(3)=1, Oc(3)=1 w1,1 = 0 w1,2 = OA(1) × OB(1) + OA(2) × OB(2) + OA(3) × OB(3) = (-1) × (-1) + 1 × (-1) + (-1) × 1 = 1 w1,3 = OA(1) × OC(1) + OA(2) × OC(2) + OA(3) × OC(3) = (-1) × 1 + 1 × (-1) + (-1) × 1 = -3 w2,2 = 0 w2,1 = OB(1) × OA(1) + OB(2) × OA(2) + OB(3) × OA(3) = (-1) × (-1) + (-1) × 1 + 1 × (-1) = -1 w2,3 = OB(1) × OC(1) + OB(2) × OC(2) + OB(3) × OC(3) = (-1) × 1 + (-1) × (-1) + 1 × 1 = 1 w3,3 = 0 w3,1 = OC(1) × OA(1) + OC(2) × OA(2) + OC(3) × OA(3) = 1 × (-1) + (-1) × 1 + 1 × (-1) = -3 w3,2 = OC(1) × OB(1) + OC(2) × OB(2) + OC(3) × OB(3) = 1 × (-1) + (-1) × (-1) + 1 × 1 = 1 Tuesday, December 10, 2013 227
  224. 224. Train this network with the three patterns shown. Tuesday, December 10, 2013 w1,1 = 0 w1,2 = -3 w1,3 = 1. w2,2 = 0 w2,1 = -3 w2,3 = -1 w3,3 = 0 w3,1 = 1 w3,2 = -1 228
  225. 225. “If the brain were so simple that we could understand it then we‟d be so simple that we couldn‟t” Lyall Watson Tuesday, December 10, 2013 229

×