Successfully reported this slideshow.
Your SlideShare is downloading. ×

Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 50 Ad

More Related Content

Slideshows for you (19)

Similar to Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6 (20)

Advertisement
Advertisement

Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6

  1. 1. Deep Feedforward Networks Goodfellow, Bengio, & Courville (2016) Deep Learning, Chap 6. Shigeru ONO (Insight Factory) DL 読書会: 2020/07 Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 1 / 50
  2. 2. TOC 1 6.1 Example: Learning XOR 2 6.2 Gradient-Based Learning 3 6.3 Hidden Units 4 6.4 Architecture Design 5 6.5 Back-Propagation and Other Differentiation Algorithms 6 6.6 Historical Notes Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 2 / 50
  3. 3. (introduction) deep feedforward network aka: feedforward neural network, multilayer perceptrons (MLP) Purpose: to approximate some function f∗ no feedback connections Why is it called ”network”? represented by composing many different functions 1st layer, 2nd layer, ..., output layer Why is it called ”neural”? Hidden layers consist of vector-to-vector functions We can think of the layer as consisting of many units (vector-to-scalar functions) that act in parallel Each unit resembles a neuron Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 3 / 50
  4. 4. (introduction) To extend linear models, we can apply the linear model to a nonlinearly transformed input ϕ(x). How to choose ϕ? use a generic ϕ. E.g. infinite-dimensional ϕ in kernel machines manually engineer ϕ. learn ϕ ... the strategy of DL Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 4 / 50
  5. 5. 6.1 Example: Learning XOR Target function f∗ : XOR function Training set: X = {[ 0 0 ] , [ 0 1 ] , [ 1 0 ] , [ 1 1 ]} Model: f(x; θ) Loss function: MSE J(θ) = 1 4 ∑ x∈X(f∗ (x) − f(x; θ))2 ↓ Linear model cannot represent the XOR function. ↓ Learn a different feature space, where a linear model can represent the solution. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 5 / 50
  6. 6. 6.1 Example: Learning XOR We introduce a simple feedforward network : f(x; W, c, w, b) = f(2) (f(1) (x; W, c); w, b) In most neural networks, an affine transformation controlled by learned parameters is used in f(1) : f(1) (x; W, c) = g(W⊤ x + c) g is typically an element-wise function. In modern neural networks, the default recommendation for g is the rectified linear unit (ReLU) : g(z) = max{0, z} Our complete network: f(x; W, c, w, b) = w⊤ max{0, W⊤ x + c} + b Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 6 / 50
  7. 7. 6.1 Example: Learning XOR Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 7 / 50
  8. 8. 6.1 Example: Learning XOR Let X be the design matrix X =     0 0 1 1 1 0 1 1     Set W = [ 1 1 1 1 ] , c = [ 0 −1 ] , w = [ 1 −2 ] Then the matrix of max{0, W⊤ x + c} is     0 0 1 0 1 0 2 1     With multiplying w we get the correct answers [0 1 1 0]⊤ . Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 8 / 50
  9. 9. 6.2 Gradient-Based Learning The largest difference between the linear models and NN is that most interesting loss function for NN are nonconvex. NN are usually trained by gradient-based optimizer (rather than linear equation solvers or the convex optimizers). convergence is not guaranteed. sensitive to the initial parameters values. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 9 / 50
  10. 10. 6.2.1 Cost Functions 1. learning conditional distribution Most modern NN models define p(y|x; θ) and simply use the ML principle. The cost function is the negative log likelihood (= the cross-entropy b/w training data and model distribution): J(θ) = −Ex,y∼ˆpdata log pmodel(y|x) Advantage: Specifying a model p(y|x) automatically determines a cost function log(y|x). Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 10 / 50
  11. 11. 6.2.1 Cost Functions 2. learning conditional statistics In some cases we want to learn just one conditional statistics of y given x. We can view the cost function as being a functional (=mapping from functions to real numbers) rather than a function. For example, when we wish to predict the mean of y, we can design the cost functional to have its minimum lies on the function f(E(y|x); x). Solving an optimization problem with respect to a function requires calculus of variations(変分法). Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 11 / 50
  12. 12. 6.2.1 Cost Functions (cont’d) Suppose the optimization problem is f∗ = arg min f Ex,y∼pdata ||y − f(x)||2 If this function lies within the class we optimize over, it yields f∗ (x) = Ey∼pdata(y|x)[y] i.e.: if we could train on infinitely many samples, minimizing MSE cost ||y − f(x)||2 would give a function that predict E[y|x] for each value of x. Similarly, minimizing MAE cost ||y − f(x)||1 would give a function that predict median(y) for each value of x. MSE & MAE often lead to poor results when used with gradient-based optimization. Cross-entropy cost is more popular, even when we do not need to estimate p(y|x). Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 12 / 50
  13. 13. 6.2.2 Output Units 1. linear units output units based on an affine transformation w/o nonlinearity: ˆy = W⊤ h + b often used to produce the mean of a conditional Gaussian distribution. do not saturate. Suitable for gradient-based optimization. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 13 / 50
  14. 14. 6.2.2 Output Units 2. sigmoid units ˆy = σ(w⊤ h + b) where σ(x) = 1 1+exp(−x) (logistic sigmoid function). For binary y, NN needs to predict P(y = 1|x). The ML approach is to define Bernoulli distribution conditioned on x. One possibility is to define P(y = 1|x) = max{0, min{1, w⊤ h + b}} but it has no gradient outside [0, 1]. Let ˜P(y) as unnormalized probability of y. Assume that log ˜P(y) = yz   (i.e. ˜P(y = 0|z) = 0, ˜P(y = 1|z) = z). Then we can derive Bernoulli distribution: P(y) = yz exp(0) + exp(z) = σ((2y − 1)z) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 14 / 50
  15. 15. 6.2.2 Output Units (cont’d) The loss function for ML is J(θ) = − log σ((2y − 1)z) = ζ((1 − 2y)z) where ζ(x) = log(1 + exp(x)) (softplus function). It saturates only when (1 − 2y)z is very negative (i.e. the model already has the right answer). Other loss functions (e.g. MSE) can saturate anytime σ(z) saturates. So ML is preferred. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 15 / 50
  16. 16. 6.2.2 Output Units 3. softmax units Now we with to generalize the sigmoid function to the case of a discrete variables with n values. We need to produce a vector ˆy with ˆyi = P(y = i|x). Assume we can predict nonnormalized log probability vector z as zi = log ˜P(y = 1|x). We can obtain the desired ˆy as ˆyi = softmax(z)i = exp(zi) ∑n j=1 exp(zj) In ML approach we with to maximize the log-likelihood log P(y = i|z) = log softmax(z), where log softmax(z)i = zi − log ∑ j exp(zj) Many objective functions other than the log-likelihood do not work as well with the softmax function. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 16 / 50
  17. 17. 6.2.2 Output Units (cont’d) z can be produced as z = W⊤ h + b, but it actually overparameterizes the distribution. Or we can impose a requirement that one of zi be fixed. In practice it rarely makes differences. Origin of the name: ”soft” means it is continuous and differentiable. It would perhaps be better to call ”softargmax”. 4. Other output types ...skipped ... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 17 / 50
  18. 18. 6.3 Hidden Units ReLU is an excellent default choice We can disregard whether the activation functions is differentiable at all input point or not most hidden units (1) accept a vector x, (2) compute an affine transformation z = W⊤ x + b, and (3) apply an element-wise nonlinear function g(z). They are distinguished only by the choice of g(z). Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 18 / 50
  19. 19. 6.3.1 ReLU and their generalizations ReLU (Rectified linear units): the activation function is g(z) = max{0, z} typically used after an affine transformation: h = g(W⊤ x + b) when initializing, set all elements of b to a small positive value (e.g. 0.1) Drawback: ReLU cannot learn on examples for which their activation is zero Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 19 / 50
  20. 20. 6.3.1 ReLU and their generalizations Generalization of ReLU: hi = g(z, α)i = max(0, zi) + αi min(0, zi) Absolute value rectification: αi = −1; g(z) = |z|. Used for object recognition. leaky ReLU: fixes αi to a small value like 0.01 parametric ReLU: treats αi as a learnable parameter maxout units: g(z)i = maxj∈G(i) zj where G(i) is a group of k elements in z. maxout units can learn a piecewise linear, convex function each unit is parameterized by k weight vectors each unit is driven by multiple filters. It resists ”catastrophic forgetting” (forgetting of how to perform task) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 20 / 50
  21. 21. 6.3.2 Logistic Sigmoid and Hyperbolic Tangent logistic sigmoid activation function: g(z) = σ(z) mainly used prior to the introduction of ReLU saturate across most of the domain. Gradient-Based learning is difficult now only used for output units ()or other setting than feed-forward network) hyperbolic tangent activation function: g(z) = tanh(z) = 2σ(2z) − 1 mainly used prior to the introduction of ReLU performs better than the logistic sigmoid Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 21 / 50
  22. 22. 6.3.3 Other Hidden Units A wide variety of differentiable functions perform well. identity function. (It is acceptable for some layers to be purely linear) softmax function radial basis function hi = exp(− 1 σ2 i ||W:,i − x||2 ) softplus function. ()generally discouraged) hard tanh g(a) = max(−1, min(1, a)) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 22 / 50
  23. 23. 6.4 Architecture Design architecture: the overall structure of network. Most NN are organized into layers. 1st layer: h(1) = g(1) (W(1)⊤ x + b(1) ) 2nd layer: h(2) = g(2) (W(2)⊤ h(1) + b(2) ) ... Main considerations: Depth: number of layers Width: number of units in each layer Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 23 / 50
  24. 24. 6.4.1 Universal Approximation Properties and Depth Universal approximation theorem: if a feedforward network has (1) a linear output layer, (2) at least one hidden layer with any ”squashing” activation function, and (3) has enough hidden units, then it can approximate any Borel measurable function from one finite-dimensional space to another, with any desired nonzero amount of error ”spuashing” function ... e.g. logistic sigmoid Borel measurable function ... including any continuous function on a closed and bounded subset of Rn In other words: a large feedforward network will be able to represent any function we are trying to learn Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 24 / 50
  25. 25. 6.4.1 Universal Approximation Properties and Depth But... ’represent’ ̸= ’learn’. MLP may fail to find parameters or choose the wrong functions. Even if one hidden layer is enough, the layer may be infeasibly large. Towards deeper models: In many circumstances, using deeper models can reduce the number of units. Statistical reason: Choosing a deep model means that we believe the learning problem consists of discovering a set of underlying factors, which can be described in terms of simpler underlying factors Some experiments suggests deep architectures express a useful prior Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 25 / 50
  26. 26. 6.4.3 Other Architectural Considerations ...skipped... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 26 / 50
  27. 27. 6.5 Back-Propagation and Other Differentiation Algorithms forward propagation (順伝播): The input x provides the initial information It propagates up to the hidden layers and finally produce ˆy Forward propagation can continue until it produce a scalar cost J(θ) back-propagation algorithm (backprop, 誤差逆伝播法): The cost J(θ) provides the initial information It flows backward through the network in order to compute the gradients a simpler procedure than evaluating the gradient analytically It is not the whole learning algorithm, but the method for computing the gradient. Another algorithm (e.g. stochastic gradient descent) is used to perform learning. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 27 / 50
  28. 28. 6.5.1 Computational Graphs Graph expression of computation node: a variable edge from x to y: an operation to a variable x which computes y Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 28 / 50
  29. 29. 6.5.1 Computational Graphs Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 29 / 50
  30. 30. 6.5.2 Chain Rule of Calculus Let x be a real number. Suppose that y = g(x), z = f(y). Then dz dx = dz dy dy dx Suppose that x ∈ Rm , y ∈ Rn , y = g(x), z = f(y). Then ∂z ∂xi = ∑ j ∂z ∂yj ∂yj ∂xi In vector notation, ∇xz = ( ∂y ∂x )⊤ ∇yz where ∂y ∂x is n × m Jacobian matrix of g. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 30 / 50
  31. 31. 6.5.3 Recursively applying the Chain Rule to Obtain Backprop Setting: Consider a computational graph describing how to compute a single scalar u(n) (e.g. the loss of a training example) We want to obtain u(n)’s gradient with respect to the ni input nodes u(1) , . . . , u(ni) The nodes of the graph have been ordered in such a way that we can compute their output one after the other, starting u(ni+1) and going up to u(n) . u(i) = f(A(i) ) where A(i) is the set of all parent nodes of u(i) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 31 / 50
  32. 32. 6.5.3 Recursively applying the Chain Rule to Obtain Backprop algorithm for the forward propagation computation: Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 32 / 50
  33. 33. 6.5.3 Recursively applying the Chain Rule to Obtain Backprop algorithm for the backprop that specifies the actual gradient computation: Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 33 / 50
  34. 34. 6.5.3 Recursively applying the Chain Rule to Obtain Backprop algorithm for the backprop that specifies the actual gradient computation: Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 34 / 50
  35. 35. 6.5.4 Backprop Computation in Fully Connected MLP Point: apply the chain rule in order to get derivative ∂u(n) ∂u(j) : ∂u(n) ∂u(j) = ∑ i∈(children of u(j)) ∂u(n) ∂u(i) ∂u(i) ∂u(j) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 35 / 50
  36. 36. 6.5.4 Backprop Computation in Fully Connected MLP Consider a computational graph of a fully-connected multi layer MLP. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 36 / 50
  37. 37. 6.5.4 Backprop Computation in Fully Connected MLP Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 37 / 50
  38. 38. 6.5.5 Symbol-to-Symbol Derivatives symbol-to-number differentiation take a computational graph and a set of numerical input values return a set of gradient values at those input values used by Torch and Caffe symbol-to-symbol derivatives approach take a computational graph add additional nodes of symbolic descriptions of the desired derivatives used by Theano and TensorFlow Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 38 / 50
  39. 39. 6.5.5 Symbol-to-Symbol Derivatives Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 39 / 50
  40. 40. 6.5.6 General Back-Propagation To compute the gradient of z with respect to its ancestors x, the gradient of z with respect to z: dz dz = 1. It is the current gradient. the gradient of z with respect to its parent: (the current gradient) x (Jacobian of the operation that produced z) the gradient of z with respect to its grandparent: (the current gradient) x (Jacobian of the operation that produced the parent) ... if we reach a node through multiple paths, simply sum the gradients Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 40 / 50
  41. 41. 6.5.6 General Back-Propagation More formally... Assume the subroutines below: get_operation(V): returns the operation that compute V (the edges into V) get_consumers(V, g): returns the list of V ’s children in the graph g get_inputs(V, g): returns the list of V ’s parent in the graph g Each operation op has methods below: op.f(inputs): implementation of operation op.bprop(inputs, X, G): implementation of the chain rule. X: the input whose gradient we with to compute. G: the gradient on the output of the operation . return ∑ i(∇Xop.f(inputs)i)Gi. E.g. We use a operation (op) C = AB. G is the gradient of a scalar z with respect to C. You can call op.bprop((A, B), A, G) to get the gradient with respect to A, which is given by GB⊤ . Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 41 / 50
  42. 42. 6.5.6 General Back-Propagation Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 42 / 50
  43. 43. 6.5.6 General Back-Propagation Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 43 / 50
  44. 44. 6.5.7 Example: Back-Propagation for MLP Training ...skipped... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 44 / 50
  45. 45. 6.5.8 Complications Actual implementation of backprop has to be more complex... Most implementation need to support operations that can return more than one tensor. how to control memory consumption handling of various data type (32bit fp, 64bit fp, int, ...) tracking undefined gradient Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 45 / 50
  46. 46. 6.5.9 Differentiation outside the DL Community ...skipped... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 46 / 50
  47. 47. 6.5.10 Higher Order Derivatives ...skipped... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 47 / 50
  48. 48. 6.6 Historical Notes 17c: the chain rule 19c: the gradient descent technique 1940s: machine learning models (e.g. perceptron) based on linear models. Critics (e.g. Minsky) pointed out the flows of the linear model, which led to a backlash against the entire NN approach. 1960s-70s: efficient applications of the chain rule 1980s: applying the chain rule for learning of nonlinear functions in NN PDP (Rumelhart et al, 1986). Popularization of backprop & multilayer NN, and of ”connectionism”. early 1990s: a peak of NN research. 2006: renaissance of modern DL Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 48 / 50
  49. 49. 6.6 Historical Notes Why did NN performance improve in 1986-2015? Two main factors are: larger datasets larger networks with powerful computers and better software Small number of algorithmic change have also improved NN. loss function: replacement of MSE with cross-entropy family hidden units: replacement of sigmoid with piecewise linear function (e.g. ReLU) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 49 / 50
  50. 50. 6.6 Historical Notes Even after 2006, feedforward network continued to have a bad reputation. It was widely believed that feedforward networks would not perform well unless they were assisted by other models Since 2012, feed forward network with gradient-based learning has been viewed as a powerful technology. They continue to have unfulfilled potential. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 50 / 50

×