Dm part03 neural-networks-handout

1,275 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,275
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
77
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Dm part03 neural-networks-handout

  1. 1. Christof Monz Informatics Institute University of Amsterdam Data Mining Part 3: Neural Networks Overview Christof Monz Data Mining - Part 3: Neural Networks 1 Perceptrons Gradient descent search Multi-layer neural networks The backpropagation algorithm
  2. 2. Neural Networks Christof Monz Data Mining - Part 3: Neural Networks 2 Analogy to biological neural systems, the most robust learning systems we know Attempt to understand natural biological systems through computational modeling Massive parallelism allows for computational efficiency Help understand ‘distributed’ nature of neural representations Intelligent behavior as an ‘emergent’ property of large number of simple units rather than from explicitly encoded symbolic rules and algorithms Neural Network Learning Christof Monz Data Mining - Part 3: Neural Networks 3 Learning approach based on modeling adaptation in biological neural systems Perceptron: Initial algorithm for learning simple neural networks (single layer) developed in the 1950s Backpropagation: More complex algorithm for learning multi-layer neural networks developed in the 1980s.
  3. 3. Real Neurons Christof Monz Data Mining - Part 3: Neural Networks 4 Human Neural Network Christof Monz Data Mining - Part 3: Neural Networks 5
  4. 4. Modeling Neural Networks Christof Monz Data Mining - Part 3: Neural Networks 6 Perceptrons Christof Monz Data Mining - Part 3: Neural Networks 7
  5. 5. Perceptrons Christof Monz Data Mining - Part 3: Neural Networks 8 A perceptron is a single layer neural network with one output unit The output of a perceptron is computed as follows o(x1 ...xn) = 1 ifw0 +w1x1 +...+wnxn > 0 −1 otherwise Assume a ‘dummy’ input x0 = 1 we can write: o(x1 ...xn) = 1 if ∑n i=0 wixi > 0 −1 otherwise Perceptrons Christof Monz Data Mining - Part 3: Neural Networks 9 Learning a perceptron involves choosing the ‘right’ values for the weights w0 ...wn The set of candidate hypotheses is H = {w | w ∈ ℜ(n+1)}
  6. 6. Representational Power Christof Monz Data Mining - Part 3: Neural Networks 10 A single perceptron represent many boolean functions, e.g. AND, OR, NAND (¬AND), . . . , but not all (e.g., XOR) Peceptron Training Rule Christof Monz Data Mining - Part 3: Neural Networks 11 The perceptron training rule can be defined for each weight as: wi ← wi +∆wi where ∆wi = η(t −o)xi where t is the target output, o is the output of the perceptron, and η is the learning rate This scenario assume that we know what the target outputs are supposed to be like
  7. 7. Peceptron Training Rule Example Christof Monz Data Mining - Part 3: Neural Networks 12 If t = o then η(t −o)xi = 0 and ∆wi = 0, i.e. the weight for wi remains unchanged, regardless of the learning rate and the input values (i.e. xi) Let’s assume a learning rate of η = 0.1 and an input value of xi = 0.8 • If t = +1 and o = −1, then ∆wi = 0.1(1 −(−1)))0.8 = 0.16 • If t = −1 and o = +1, then ∆wi = 0.1(−1 −1)))0.8 = −0.16 Peceptron Training Rule Christof Monz Data Mining - Part 3: Neural Networks 13 The perceptron training rule converges after a finite number of iterations Stopping criterion holds if the amount of changes falls below a pre-defined threshold θ, e.g., if |∆w|L1 < θ But only if the training examples are linearly separable
  8. 8. The Delta Rule Christof Monz Data Mining - Part 3: Neural Networks 14 The delta rule overcomes the shortcoming of the perceptron training rule not being guaranteed to converge if the examples are not linearly separable Delta rule is based on gradient descent search Let’s assume we have an unthresholded perceptron: o(x) = w ·x We can define the training error as: E(w) = 1 2 ∑ d∈D (td −od)2 where D is the set of training examples Error Surface Christof Monz Data Mining - Part 3: Neural Networks 15
  9. 9. Gradient Descent Christof Monz Data Mining - Part 3: Neural Networks 16 The gradient of E is the vector pointing in the direction of the steepest increase for any point on the error surface ∇E(w) = ∂E ∂w0 , ∂E ∂w1 ,..., ∂E ∂wn Since we are interested in minimizing the error, we consider negative gradients: −∇E(w) The training rule for gradient descent is: w ← w +∆w where ∆w = −η∇E(w) Gradient Descent Christof Monz Data Mining - Part 3: Neural Networks 17 The training rule for individual weights is defined as wi ← wi +∆wi where ∆wi = −η ∂E ∂wi Instantiating E for the error function we use gives: ∂E ∂wi = ∂ ∂wi 1 2 ∑ d∈D (td −od)2 How do we use partial derivatives to actually compute updates to weights at each step?
  10. 10. Gradient Descent Christof Monz Data Mining - Part 3: Neural Networks 18 ∂E ∂wi = ∂ ∂wi 1 2 ∑ d∈D (td −od)2 = 1 2 ∑ d∈D ∂ ∂wi (td −od)2 = 1 2 ∑ d∈D 2(td −od) ∂ ∂wi (td −od) = ∑ d∈D (td −od) ∂ ∂wi (td −od) ∂E ∂wi = ∑ d∈D (td −od)·(−xid) Gradient Descent Christof Monz Data Mining - Part 3: Neural Networks 19 The delta rule for individual weights can now be written as wi ← wi +∆wi where ∆wi = η ∑ d∈D (td −od)xid The gradient descent algorithm • picks initial random weights • computes the outputs • updates each weight by adding ∆wi • repeats until converge
  11. 11. The Gradient Descent Algorithm Christof Monz Data Mining - Part 3: Neural Networks 20 Each training example is a pair < x,t > 1 Initialize each wi to some small random value 2 Until the termination condition is met do: 2.1 Initialize each ∆wi to 0 2.2 For each < x,t >∈ D do 2.2.1 Compute o(x) 2.2.2 For each weight wi do ∆wi ← ∆wi +η(t −o)xi 2.3 For each weight wi do wi ← wi +∆wi The Gradient Descent Algorithm Christof Monz Data Mining - Part 3: Neural Networks 21 The gradient descent algorithm will find the global minimum, provided that the learning rate is small enough If the learning rate is too large, this algorithm runs into the risk of overstepping the global minimum It’s a common strategy to gradually the decrease the learning rate This algorithm works also in case the training examples are not linearly separable
  12. 12. Shortcomings of Gradient Descent Christof Monz Data Mining - Part 3: Neural Networks 22 Converging to a minimum can be quite slow (i.e. it can take thousands of steps). Increasing the learning rate on the other hand can lead to overstepping minima If there are multiple local minima in the error surface, gradient descent can get stuck in one of them and not find the global minimum Stochastic gradient descent alleviates these difficulties Stochastic Gradient Descent Christof Monz Data Mining - Part 3: Neural Networks 23 Gradient descent updates the weights after summing over all training examples Stochastic (or incremental) gradient descent updates weights incrementally after calculating the error for each individual training example This this end step 2.3 is deleted and step 2.2.2 modified
  13. 13. Stochastic Gradient Descent Christof Monz Data Mining - Part 3: Neural Networks 24 Each training example is a pair < x,t > 1 Initialize each wi to some small random value 2 Until the termination condition is met do: 2.1 Initialize each ∆wi to 0 2.2 For each < x,t >∈ D do 2.2.1 Compute o(x) 2.2.2 For each weight wi do wi ← wi +η(t −o)xi Comparison Christof Monz Data Mining - Part 3: Neural Networks 25 In standard gradient descent summing over multiple examples requires more computations per weight update step As a consequence standard gradient descent often uses larger learning rates than stochastic gradient descent Stochastic gradient descent can avoid falling into local minima because it uses the different ∇Ed(w) rather than the overall ∇E(w) to guide its search
  14. 14. Multi-Layer Neural Networks Christof Monz Data Mining - Part 3: Neural Networks 26 Perceptrons only have two layers: the input layer and the output layer Perceptrons only have one output unit Perceptrons are limited in their expressiveness Multi-layer neural networks consist of an input layer, a hidden layer, and an output layer Multi-layer neural networks can have several output units Multi-Layer Neural Networks Christof Monz Data Mining - Part 3: Neural Networks 27
  15. 15. Multi-Layer Neural Networks Christof Monz Data Mining - Part 3: Neural Networks 28 The units of the hidden layer function as input units to the next layer However, multiple layers of linear units still produce only linear functions The step function in perceptrons is another choice, but it is not differentiable, and therefore not suitable for gradient descent search Solution: the sigmoid function, a non-linear, differentiable threshold function Sigmoid Unit Christof Monz Data Mining - Part 3: Neural Networks 29
  16. 16. The Sigmoid Function Christof Monz Data Mining - Part 3: Neural Networks 30 The output is computed as o = σ(w ·x) where σ(y) = 1 1+e−y i.e. o = σ(w ·x) = 1 1+e−(w·x) Another nice property of the sigmoid function is that its derivative is easily expressed: dσ(y) dy = σ(y)·(1 −σ(y)) Learning with Multiple Layers Christof Monz Data Mining - Part 3: Neural Networks 31 The gradient descent search can be used to train multi-layer neural networks, but the algorithm has to be adapted Firstly, there can be multiple output units, and therefore the error function as to be generalized: E(w) = 1 2 ∑ d∈D ∑ k∈outputs (tkd −okd)2 Secondly, the error ‘feedback’ has to be fed through multiple layers
  17. 17. Backpropagation Algorithm Christof Monz Data Mining - Part 3: Neural Networks 32 For each training example < x,t > do 1. Input x to the network and compute ou for every unit in the network 2. For each output unit k calculate its error δk : δk ← ok (1 −ok )(tk −ok ) 3. For each hidden unit h calculate its error δh: δh ← oh(1 −oh) ∑ k∈outputs wkhδk 4. Update each network weight wji : wji ← wji +∆wji where ∆wji = ηδj xji Note: xji is the value from unit i to j and wji is the weight of connecting unit i to j, Backpropagation Algorithm Christof Monz Data Mining - Part 3: Neural Networks 33 Step 1 propagates the input forward through the network Steps 2–4 propagate the errors backward through the network Step 2 is similar to the delta rule in gradient descent (step 2.3) Step 3 sums over the errors of all output units influence by a given hidden unit (this is because the training data only provides direct feedback for the output units)
  18. 18. Applications of Neural Networks Christof Monz Data Mining - Part 3: Neural Networks 34 Text to speech Fraud detection Automated vehicles Game playing Handwriting recognition Summary Christof Monz Data Mining - Part 3: Neural Networks 35 Perceptrons, simple one layer neural networks Perceptron training rule Gradient descent search Multi-layer neural networks Backpropagation algorithm

×