Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this document? Why not share!

Dm part03 neural-networks-handout

on

  • 1,333 views

 

Statistics

Views

Total Views
1,333
Views on SlideShare
1,333
Embed Views
0

Actions

Likes
0
Downloads
69
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Dm part03 neural-networks-handout Dm part03 neural-networks-handout Document Transcript

  • Christof Monz Informatics Institute University of Amsterdam Data Mining Part 3: Neural Networks Overview Perceptrons Gradient descent search Multi-layer neural networks The backpropagation algorithm Christof Monz Data Mining - Part 3: Neural Networks 1
  • Neural Networks Analogy to biological neural systems, the most robust learning systems we know Attempt to understand natural biological systems through computational modeling Massive parallelism allows for computational efficiency Help understand ‘distributed’ nature of neural representations Intelligent behavior as an ‘emergent’ property of large number of simple units rather than from explicitly encoded symbolic rules and algorithms Christof Monz Data Mining - Part 3: Neural Networks 2 Neural Network Learning Learning approach based on modeling adaptation in biological neural systems Perceptron: Initial algorithm for learning simple neural networks (single layer) developed in the 1950s Backpropagation: More complex algorithm for learning multi-layer neural networks developed in the 1980s. Christof Monz Data Mining - Part 3: Neural Networks 3
  • Real Neurons Christof Monz Data Mining - Part 3: Neural Networks 4 Human Neural Network Christof Monz Data Mining - Part 3: Neural Networks 5
  • Modeling Neural Networks Christof Monz Data Mining - Part 3: Neural Networks 6 Perceptrons Christof Monz Data Mining - Part 3: Neural Networks 7
  • Perceptrons A perceptron is a single layer neural network with one output unit The output of a perceptron is computed as follows 1 if w0 + w1 x1 + . . . + wn xn > 0 o(x1 . . . xn ) = −1 otherwise Assume a ‘dummy’ input x0 = 1 we can write: 1 if ∑n=0 wi xi > 0 i o(x1 . . . xn ) = −1 otherwise Christof Monz Data Mining - Part 3: Neural Networks 8 Perceptrons Learning a perceptron involves choosing the ‘right’ values for the weights w0 . . . wn The set of candidate hypotheses is H = {w | w ∈ ℜ(n+1) } Christof Monz Data Mining - Part 3: Neural Networks 9
  • Representational Power A single perceptron represent many boolean functions, e.g. AND, OR, NAND (¬AND), . . . , but not all (e.g., XOR) Christof Monz Data Mining - Part 3: Neural Networks 10 Peceptron Training Rule The perceptron training rule can be defined for each weight as: wi ← wi + ∆wi where ∆wi = η(t − o)xi where t is the target output, o is the output of the perceptron, and η is the learning rate This scenario assume that we know what the target outputs are supposed to be like Christof Monz Data Mining - Part 3: Neural Networks 11
  • Peceptron Training Rule Example If t = o then η(t − o)xi = 0 and ∆wi = 0, i.e. the weight for wi remains unchanged, regardless of the learning rate and the input values (i.e. xi ) Let’s assume a learning rate of η = 0.1 and an input value of xi = 0.8 • If t = +1 and o = −1, then ∆wi = 0.1(1 − (−1)))0.8 = 0.16 • If t = −1 and o = +1, then ∆wi = 0.1(−1 − 1)))0.8 = −0.16 Christof Monz Data Mining - Part 3: Neural Networks 12 Peceptron Training Rule The perceptron training rule converges after a finite number of iterations Stopping criterion holds if the amount of changes falls below a pre-defined threshold θ, e.g., if |∆w |L1 < θ But only if the training examples are linearly separable Christof Monz Data Mining - Part 3: Neural Networks 13
  • The Delta Rule The delta rule overcomes the shortcoming of the perceptron training rule not being guaranteed to converge if the examples are not linearly separable Delta rule is based on gradient descent search Let’s assume we have an unthresholded perceptron: o(x ) = w · x We can define the training error as: E (w ) = 1 ∑ (td − od )2 2 d ∈D where D is the set of training examples Christof Monz Data Mining - Part 3: Neural Networks 14 Error Surface Christof Monz Data Mining - Part 3: Neural Networks 15
  • Gradient Descent The gradient of E is the vector pointing in the direction of the steepest increase for any point on the error surface ∂E ∂E ∂E ∇E (w ) = ∂w0 , ∂w1 , . . . , ∂wn Since we are interested in minimizing the error, we consider negative gradients: −∇E (w ) The training rule for gradient descent is: w ← w + ∆w where ∆w = −η∇E (w ) Christof Monz Data Mining - Part 3: Neural Networks 16 Gradient Descent The training rule for individual weights is defined as wi ← wi + ∆wi ∂E where ∆wi = −η ∂wi Instantiating E for the error function we use gives: ∂wi = ∂∂ i 1 ∑ (td − od )2 ∂E w 2 d ∈D How do we use partial derivatives to actually compute updates to weights at each step? Christof Monz Data Mining - Part 3: Neural Networks 17
  • Gradient Descent ∂E ∂ 1 ∂wi = ∑ (td − od )2 ∂wi 2 d ∈D 1 ∂ = ∑ ∂wi (td − od )2 2 d ∈D 1 ∂ = ∑ 2(td − od ) ∂wi (td − od ) 2 d ∈D ∂ = ∑ (td − od ) (td − od ) d ∈D ∂wi ∂E ∂wi = ∑ (td − od ) · (−xid ) d ∈D Christof Monz Data Mining - Part 3: Neural Networks 18 Gradient Descent The delta rule for individual weights can now be written as wi ← wi + ∆wi where ∆wi = η ∑ (td − od )xid d ∈D The gradient descent algorithm • picks initial random weights • computes the outputs • updates each weight by adding ∆wi • repeats until converge Christof Monz Data Mining - Part 3: Neural Networks 19
  • The Gradient Descent Algorithm Each training example is a pair < x , t > 1 Initialize each wi to some small random value 2 Until the termination condition is met do: 2.1 Initialize each ∆wi to 0 2.2 For each < x , t >∈ D do 2.2.1 Compute o(x ) 2.2.2 For each weight wi do ∆wi ← ∆wi + η(t − o)xi 2.3 For each weight wi do wi ← wi + ∆wi Christof Monz Data Mining - Part 3: Neural Networks 20 The Gradient Descent Algorithm The gradient descent algorithm will find the global minimum, provided that the learning rate is small enough If the learning rate is too large, this algorithm runs into the risk of overstepping the global minimum It’s a common strategy to gradually the decrease the learning rate This algorithm works also in case the training examples are not linearly separable Christof Monz Data Mining - Part 3: Neural Networks 21
  • Shortcomings of Gradient Descent Converging to a minimum can be quite slow (i.e. it can take thousands of steps). Increasing the learning rate on the other hand can lead to overstepping minima If there are multiple local minima in the error surface, gradient descent can get stuck in one of them and not find the global minimum Stochastic gradient descent alleviates these difficulties Christof Monz Data Mining - Part 3: Neural Networks 22 Stochastic Gradient Descent Gradient descent updates the weights after summing over all training examples Stochastic (or incremental) gradient descent updates weights incrementally after calculating the error for each individual training example This this end step 2.3 is deleted and step 2.2.2 modified Christof Monz Data Mining - Part 3: Neural Networks 23
  • Stochastic Gradient Descent Each training example is a pair < x , t > 1 Initialize each wi to some small random value 2 Until the termination condition is met do: 2.1 Initialize each ∆wi to 0 2.2 For each < x , t >∈ D do 2.2.1 Compute o(x ) 2.2.2 For each weight wi do wi ← wi + η(t − o)xi Christof Monz Data Mining - Part 3: Neural Networks 24 Comparison In standard gradient descent summing over multiple examples requires more computations per weight update step As a consequence standard gradient descent often uses larger learning rates than stochastic gradient descent Stochastic gradient descent can avoid falling into local minima because it uses the different ∇Ed (w ) rather than the overall ∇E (w ) to guide its search Christof Monz Data Mining - Part 3: Neural Networks 25
  • Multi-Layer Neural Networks Perceptrons only have two layers: the input layer and the output layer Perceptrons only have one output unit Perceptrons are limited in their expressiveness Multi-layer neural networks consist of an input layer, a hidden layer, and an output layer Multi-layer neural networks can have several output units Christof Monz Data Mining - Part 3: Neural Networks 26 Multi-Layer Neural Networks Christof Monz Data Mining - Part 3: Neural Networks 27
  • Multi-Layer Neural Networks The units of the hidden layer function as input units to the next layer However, multiple layers of linear units still produce only linear functions The step function in perceptrons is another choice, but it is not differentiable, and therefore not suitable for gradient descent search Solution: the sigmoid function, a non-linear, differentiable threshold function Christof Monz Data Mining - Part 3: Neural Networks 28 Sigmoid Unit Christof Monz Data Mining - Part 3: Neural Networks 29
  • The Sigmoid Function The output is computed as o = σ(w · x ) where σ(y ) = 1+1 −y e 1 i.e. o = σ(w · x ) = 1+e−(w ·x ) Another nice property of the sigmoid function is that its derivative is easily expressed: d σ(y ) dy = σ(y ) · (1 − σ(y )) Christof Monz Data Mining - Part 3: Neural Networks 30 Learning with Multiple Layers The gradient descent search can be used to train multi-layer neural networks, but the algorithm has to be adapted Firstly, there can be multiple output units, and therefore the error function as to be generalized: 1 E (w ) = 2 ∑ ∑ (tkd − okd )2 d ∈D k ∈outputs Secondly, the error ‘feedback’ has to be fed through multiple layers Christof Monz Data Mining - Part 3: Neural Networks 31
  • Backpropagation Algorithm For each training example < x , t > do 1. Input x to the network and compute ou for every unit in the network 2. For each output unit k calculate its error δk : δk ← ok (1 − ok )(tk − ok ) 3. For each hidden unit h calculate its error δh : δh ← oh (1 − oh ) ∑ wkh δk k ∈outputs 4. Update each network weight wji : wji ← wji + ∆wji where ∆wji = ηδj xji Note: xji is the value from unit i to j and wji is the weight of connecting unit i to j , Christof Monz Data Mining - Part 3: Neural Networks 32 Backpropagation Algorithm Step 1 propagates the input forward through the network Steps 2–4 propagate the errors backward through the network Step 2 is similar to the delta rule in gradient descent (step 2.3) Step 3 sums over the errors of all output units influence by a given hidden unit (this is because the training data only provides direct feedback for the output units) Christof Monz Data Mining - Part 3: Neural Networks 33
  • Applications of Neural Networks Text to speech Fraud detection Automated vehicles Game playing Handwriting recognition Christof Monz Data Mining - Part 3: Neural Networks 34 Summary Perceptrons, simple one layer neural networks Perceptron training rule Gradient descent search Multi-layer neural networks Backpropagation algorithm Christof Monz Data Mining - Part 3: Neural Networks 35