Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.                                                       Upcoming SlideShare
Loading in …5
×

# Neural Networks - How do they work?

562 views

Published on

This presentation begins with explaining the basic algorithms of machine learning and using the same concepts, discusses in detail 2 supervised learning/deep learning algorithms - Artificial neural nets and Convolutional Neural Nets. The relationship between Artificial neural nets and basic machine learning algorithms such as logistic regression and soft max is also explored. For hands on the implementation of ANN's and CNN's on MNIST dataset is also explained.

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here • Be the first to comment

### Neural Networks - How do they work?

1. 1. Classification of Handwritten Digits - Artificial and Convolutional Networks
2. 2. Getting the Code: You can download the code used in the session from https://github.com/Manuyashchaudhary/AnnSession.git For setting up the environment, you can use the docker command - docker pull manuyash/annsession
3. 3. Structure of the Session: 1. Linear, Logistic Regression & Multinomial Logistic Regression - Quick Review 2. Structure of ANN’s 3. Learning in ANN’s 4. Implementation of ANN on MNIST 5. Layers of CNN - Convolutional, RELU, Pooling 6. Parameter Sharing 7. Implementation on MNIST
4. 4. Data: Data is the key to solving any machine learning or artificial intelligence problem. Divide the dataset it into training and testing data. The algorithm learns on training data and is evaluated on testing data.
5. 5. MNIST: Images are of uniform colour and size. Images are of size 28X28 pixels. Each pixel in the image can take value from 0 to 256. Properties: Fixed space, constant background.
6. 6. Image as Input: This is a 14X14 representation of pixel intensities. The numbers represent the fraction of pixel covered by the digit.
7. 7. Starting from the basics: Linear Regression To predict values of a variable (Y) which is dependent on multiple other independent variables (X). Fit a line to your training data which generalises the data well. y= mx +c; m is the slope and c is the constant.
8. 8. Fitting a line: For regression, y = x𝛃 To find the line of best fit, find 𝛃’s such that the line generalises the data well. . Cost/Error = Mean Squared Error Minimise the error to get line of best fit.
9. 9. Gradient Descent Intuition:
10. 10. Key Takeaways: You train an algorithm on training data. You are training to find best possible combination of parameters. There is a cost function. The cost function can be minimised by gradient descent. Perform a dot product between parameters and the unseen data and get the output.
11. 11. Logistic Regression: 1. It is used for a binary classification. 2. Outputs probability of a class. 3. It is based on a function called sigmoid function.
12. 12. Sigmoid Function: Linear Regression - y(x) = 𝛃x Logistic - log(p/1-p) = 𝛃x y(x) = p = 1/ [1 + e-βx ] Cost function is :
13. 13. Key Takeaways: Perform dot product between parameters and input; apply sigmoid function. There is a cost function. The cost function can be minimised for the parameters using gradient descent. What happens when the variable which is being predicted has more than 2 classes?
14. 14. Softmax: 1. Multinomial logistic regression. 2. Sigmoid → Softmax 3. For each output/category, we compute a weighted sum of the x’s, add a bias, and then apply softmax. 4. Softmax is defined as = Where xj is the summation of the jth neuron.
15. 15. Working of Softmax:
16. 16. Softmax to Neural Networks: Neural networks can be considered as a network of multiple logistic regression computations stacked in parallel and in series to each other. Softmax is applied to the last layer in neural networks. You will understand these points as we go forward.
17. 17. Structure of Artificial Neural Networks: 1. Input, output and hidden layers. 2. The layers are arranged sequentially and each layer is made of multiple neurons. 3. Input layer number of neurons = length of input vector 4. Output layer number of neurons = number of classes in the dependent or target variable.
18. 18. Assumptions, Parameters and Hyperparameters Neurons within a layer do not interact with each other. The layers are densely connected. For every neuron there is a bias and for every interaction there is a weight. The parameters are the weights and biases of the network which are to be found.
19. 19. Working of a Neuron: The input to a neuron is the weighted sum of inputs + bias. Activation function is used to introduce non-linearity in the network. If the output is greater than a threshold, the neuron will fire, otherwise not.
20. 20. Message passing in NN’s Consider 2 hidden layers ‘l-1’ and ‘l’. Total number of interconnections would be (n1 x n2). Output of layer ‘l-1’ is al-1 = (al-1 1, al-1 2, ….., al-1 n1) The output of layer l-1 will be the input to layer l. The input of layer l will be w.al-1, where w is n2xn1 matrix. Add bias to this. Output of layer l is activation applied to the input to layer l.
21. 21. Activation Functions: Activation functions are used to introduce non linearity in the network. It makes neural networks more compact. Due to this nonlinearity neural networks can approximate any measurable function. Activation functions should be smooth.
22. 22. Activation Functions: Sigmoid = 1/ 1 + e-x Tanh = (ez - e-z )/ (ez +e-z ) RELU = max(0,a)
23. 23. Sigmoid vs Tanh vs RELU Sigmoid suffers from the problem of vanishing gradient. Tanh has stronger gradients thus reducing the problem of vanishing gradient. RELU reduces the likelihood of the problem of vanishing gradient and also introduces sparsity to the network but it tend to blow up the activation. In practice, tanh usually work the best.
24. 24. Formulating a Cost Function: What is the output of the network? aL(w, b, xi) Actual Output - yi Cost Incurred on 1 Data point - Ci(aL i, yi) Total Cost - Sum of Individual costs over all data point ∑Ci Training Problem - min ∑Ci for w, b Optimisation problem - Find best combination of w, b.
25. 25. Different Cost functions: Quadratic Cost - C = ∑(aL i - yi)2 / n Cross Entropy - C = -∑[ yi ln aL i + (1-yi )ln 1-aL i ] = -∑∑(yij ln(aL ij) Exponential - C = Τ exp ( ∑(aL i - yi)2 / Τ) Kullback - Leibler Divergence = DKL(P∥Q) = DKL(yi∥aL i ) = ∑yi ln(yi/aL i)
26. 26. Cost Function: Properties 1. We must be able to write the cost function C as an average over cost functions Cx of individual training examples x. a. It allows the gradient of a single training example to be calculated. 2. The cost function should not be dependent on any activations of the network other than the final output values aL. a. This is a sort of a restriction so as we can backpropagate.
27. 27. Minimising the Cost Function: 1. To minimise the cost function, gradient descent is used. a. Why gradient descent and why not calculus? b. What is gradient descent?
28. 28. Gradient Descent - An Example - Consider a simple function C(v1, v2). For small changes Δv1 and Δv2 , the cost function changes as follows: ΔC = (∂C/∂v1)Δv1 + (∂C/∂v2)Δv2 ΔC = ▽C.Δv; ΔC is the change in the cost, ▽C is the gradient and Δv is the change in the parameters. If Δv = - η▽C then ΔC = -η▽C2 i.e. the cost will always decrease. v → v’ = v - η▽C - update rule.
29. 29. Stochastic Gradient Descent: Recall cost function assumption 1, cost function can be written as an average of cost over individual training examples. To compute the gradient ∇C, you need to compute the gradient ∇Cx of each training input separately and then average them ∇C = (∑∇Cx/ n) . SGD - Calculate the gradient of a small mini batch of say m inputs and use that as an estimator of the true gradient. Carry out updates using the gradient of the minibatch. Carry out mini batch update for another randomly chosen batch and so on until the training inputs are exhausted. This completes one epoch of learning. Repeat for the specified number of epochs. New hyperparameters - size of mini batch, number of epochs, learning rate.
30. 30. But how do we calculate these gradients: Backpropagation Let’s denote the input to any layer l as zl zl= wl al-1 + bl Output of layer l = al = f(wl al-1 + bl) = f( zl) Let’s consider the output of the output layer L for an individual input xi. aL(w,b,xi) = σ(zL(w,b,xi)) So the cost is Ci = (aL - yi)2. So now we want to differentiate this cost to calculate the gradient of Ci. Instead of differentiating w.r.t a we will differentiate w.r.t z .
31. 31. Let’s do some math: Chain Rule: ꝺ (aL(w,b,xi)- yi)2/ ꝺ zL = 2 (aL(w,b,xi) - yi) ꝺ (σ(zL(w,b,xi))) /ꝺ zL Sigmoid function :
32. 32. Gradient of Ci w.r.t. zL (output layer) d(aL(w,b,xi)- yi)2/ dzL = 2 (aL(w,b,xi) - yi) σ(zL(w,b,xi)) (1 - σ(zL(w,b,xi))) zL(w,b,xi) = wL aL-1(w1:L-1, b1:L-1, xi) + bL Aim is to calculate: d(aL(w,b,xi)- yi)2/dwL and d(aL(w,b,xi)- yi)2/dbL dCi/ dwL = dCi/ dzL * dzL /dwL and dCi/ dbL = dCi/ dzL * dzL /dbL
33. 33. Gradients of the Last Layer: dCi/ dwL = dCi/ dzL * dzL /dwL and dCi/ dbL = dCi/ dzL * dzL /dbL But we need the gradients of the entire network to update the weights and biases of the network. How does gradients of the last layer help? Backpropagation: propagating the network through the last layer Gradients of any layer can be written in the form of gradients of the next layer. Therefore, knowing the gradients of layer L, you can write the gradients of layer L-1 in terms of gradients of layer L (which are known to you), gradients of layer L-2 in terms of gradients of layer L-1 and so on.
34. 34. Gradient of Ci w.r.t. zl dCi/ dwL = dCi/ dzL * dzL /dwL and dCi/ dbL = dCi/ dzL * dzL /dbL Zl = wlal-1 + bl dCi/ dwl = dCi/ dzl * dzl /dwl and dCi/ dbl= dCi/ dzl * dzl /dbl To find out - dCi/ dzl - gradient of Ci w.r.t the cumulative input of the layer l.
35. 35. Sorry!! A little more math: dCi/ dzl = dCi/ dal * dal/ dzl dCi/ dzl = σ(zl)(1 -σ(zl)) * d Ci/ ꝺal dCi/ dzl = σ(zl)(1 -σ(zl)) *dCi/ dzl+1 *dzl+1/ dal zl+1 = wl+!al + bl+1
36. 36. dCi/ dzl = σ(zl)(1 -σ(zl)) * wl+1 * dCi/ dzl+1 dCi/ dwl = dCi/ dzl * dzl/ dwl
37. 37. ANN on MNIST:
38. 38. Convolutional Neural Networks
39. 39. How are CNN’s different from ANN’s:ow ConvNet architectures make the explicit assumption that input are images. Their architecture is different from feedforward neural networks to make them more efficient by reducing the number of parameters to be learnt. In ANN, if you have a 150x150x3 image, each neuron in the first hidden layer will have 67500 weights to learn. ConvNets have 3D input of neurons and the neurons in a layer are only connected to a small region of the layer before it.
40. 40. ConvNets: The neurons in the layers of ConvNet are arranged in 3 dimensions: height, width, depth. Depth here is not the depth of the entire network. It refers to the third dimension of the layers and hence a third dimension of the activation volumes. In essence, a ConvNet is made of layers which have a simple API - transform a 3-D input volume to a 3-D output volume with some differentiable function which may or may not have parameters.
41. 41. Layers of a ConvNet: Input Layer - 28 x 28 x 1 for MNIST (grayscale) Convolution Layer - The neurons are connected to small/local regions in the input. ReLU - This is the activation layer in the network. Pool - This layer will downsample on the width and the height but not on depth. It applies a fixed function such as max() or mean() etc. Fully Connected Layer - Just like the last layer in feedforward networks, this layer too will give us the class scores arranged across the depth dimension. Convolutional and fully connected layers have parameters, relu and pooling layers do not. Convolutional, fully connected and pooling layers have additional hyperparameters too.
42. 42. Architecture:
43. 43. Architecture:
44. 44. Convolutional (Conv) Layer In machine learning, this flashlight is known as a filter and the region it shines over is known as the receptive field/size of the filter. Filter: Is an array of numbers also known as the weights/parameters (learnable). A very important dimension to note is the depth of the filter. The depth must be equal to the depth of the input volume. This filter will now slide/convolve over the rest of the image performing element wise multiplication, summing it up and returning a single number. After convolving over the entire image, you will get an activation map which is 2-D. For a 32x32x3 dimension input, using a
45. 45. Filters: You can increase the number of filters on the input volume to increase the number of activation maps you get. Each filter gives you an activation map. Each activation map you get, tries to lean a different aspect of the image such as an edge, a blotch of colour etc. If on a 32x32x3 image volume, you implement 12 filters of size 5x5x3, then the first convolutional layer will have dimension 28x28x12 under certain conditions. Basically, the more the filters, the better the spatial dimensions are preserved. Now, let’s talk about the certain conditions
46. 46. Filter - Hyperparameters: The size of the filter is a hyperparameter. When you apply a filter to an input volume, the output volume of the filter depends on 3 hyperparameters - fibre/depth, stride and zero-padding. Fibre/Depth - This refers to the number of filters applied to the input volume, each learning to recognise something different in the input. Stride - It refers to the pace at which the filter moves through the input volume. If stride is 1, we move the filters one pixel at a time. Zero-padding - To control the size of the output volume, the input volume can be padded with zeroes around the border. Given these hyperparameters the size of the output volume is given as: ((W-F+2P)/S) +1; W is the size of input, F is the filter size, S is the stride.
47. 47. Number of Parameters: Consider the output volume of 28x28x12 of the first convolutional layer, which was achieved by applying a filter of 5x5x3 on the input of 32x32x3. Number of neurons in the layer = 28*28*12 = 9408 Each neuron has 5*5*3 = 75 weights and 1 bias i.e. 76 parameters. Overall number of parameters of the first layer = 9408*76 = 715,008
48. 48. Parameter Sharing: Simple Assumption: For each activation map or depth slice, constrain the neurons to use the same weights and bias. Therefore, for the last example, the conv layer will have a set of 12 unique weights and 12 biases. Overall number of weights in the first layer: 12*5*5*3 = 900 Total parameters = 900 +12 = 912
49. 49. Conv Layer Summary: Accepts an input volume of W1 x H1 x D1 Needs 4 Hyperparameters: Number of filters = K Filter/receptive field size = F Stride = S Zero-Padding = P The output is of size W2 x H2 x D2 where; W2 = ((W1-F+2P)/S) +1
50. 50. ReLU (Rectified Linear Units) Layer: Just like in feedforward neural networks, the purpose of an activation layer in Convnet is to introduce nonlinearity. You can also use activations like tanh or sigmoid but ReLU works better in practice. Why? It reduces the number of parameters in the network, thus enabling it to learn faster. Also, it helps us reduce the problem of vanishing gradients.
51. 51. Pooling Layers: Pooling layer is also known as the downsampling layer. Use - Progressively reduce the spatial size of its input, thus reducing the number of parameters in the network and controlling overfitting. The pooling layer works on each depth slice independently, resizes it using the mathematical operation specified such as MAX or Avg. etc. Most common form of pooling is to apply a 2x2 filter with a stride of 2 on the input volume. The depth dimension will remain unchanged.
52. 52. Pooling Layer: A 2x2 filter with stride as 2 applying a MAX function. As you can see the number of parameters are reduced by 75%.
53. 53. Fully Connected Layer, Dropout and Normalisation: Fully connected layers in Convnet is exactly the same to the layers in feedforward neural networks. The last layer in Convnet is a fully connected softmax layer. Dropout Layers: These layers have a very specific function in convnet which is to avoid overfitting. A random fraction of activations are ‘dropped out’ or set to 0 during the forward pass by this layer. This makes sure that the network is not mugging the training data. This layer is only used during training time. Normalisation - These layers are usually added after pooling layers to normalise the output of the pooling layer.
54. 54. Transfer Learning: Transfer learning is the process of taking a pre-trained model whose weights and parameters have been trained on a large dataset, and fine-tune it according to your own data. You remove the last layer of the network and replace it with you own classifier; and keep the weights and biases of the rest of the network constant. The idea is that the pre-trained model will act as a feature extractor.
55. 55. Convnet on MNIST: