Explanation of how Convolutional Neural Networks work, given to the AI Guild at Wealth Wizards.
Talk - https://www.youtube.com/watch?v=IatCeksBJFU&t=1s
11. Learned Features of an Elephant
Source: https://www.slideshare.net/akshaymuroor/deep-learning-24650492
Elephant input images Object Parts Features / Edges
Editor's Notes
Facebook image tagging,
Google photo search,
Amazon product recommendations
Super hard for computers to do what we take for granted, because an image for them is just an array of pixel values - numbers describing how green / blue/ red that pixel is, or if it’s in B&W, how light that pixel is – values from 0 to 255.
Here are some elephants. We know that these are all pictures of elephants.
But how do we know?
It’s actually quite hard… they’re all shown at different angles, in different poses, different colours and in different parts of the image. It would be very difficult to program a computer to recognise all of these images manually. So that’s where convolutional neural networks come in… they enable the computer to learn how to recognise an elephant in the same way we do.
So to better understand how, we need to think about how we recognise an image of something.
The Convolutional part of the network is basically about learning better features for the normal NN to then classify.
(After Conv & Pooling layers, there’s just a normal classifier neural net).
Without convolution, a neural network wouldn’t know that these pixels (numbers to a computer) are related to one another and they make a tusk, and these ones make a trunk. They’d be treated independently.
So if you have a 2,000 x 1,000 pixel image, that would be 2 million “independent” pixels and therefore 2 million features. The neural network would need 2 million nodes – hugely computationally expensive.
To understand convolution:
So, imagine you have a picture of a mouse.
You want to apply a filter in your convolutional layer that asserts how “curved line” the image is. [arbitrary choice of curved line]
The filter represents this curved line like the top right matrix.
So when you apply this filter to the top left section of the mouse image, you end up with a really high number, i.e. that section of the image is very similar to that curved line filter.
But when you apply it to the top right section of the mouse image, you get 0, so it’s saying it’s not similar to that curved filter at all.
During the training phase, a CNN automatically learns the values of its filters based on the task you want to perform.
Imagine a flashlight that is shining over the top left of the image. Let’s say that the light this flashlight shines covers a 5 x 5 area. And now, let’s imagine this flashlight sliding across all the areas of the input image. The flashlight is known in ML as a FILTER, and the area it shines on is called the RECEPTIVE FIELD.
The filter itself is made up of WEIGHTS.
[filter will have same depth as image, so if image is 5 x 5 x 3 (3 deep because one layer of matrix per RGB), then filter will be a tensor of 3 x 3 x 3].
The filter then slides across image – CONVOLVES, and is then multiplying the values in the filter with original pixel values – element-wise. These multiplications are then summed.
So one number outputted per location of filter on image.
So for a 5 x 5 input image, with a 3 x 3 filter, you get a 3 x 3 matrix of numbers – a FEATURE MAP.
The no. units the filter moves across on each step is called the STRIDE.
You can use several filters, which would increase the depth of the output feature map. (2 filters would give a 28 x 28 x 2 tensor)
Good to use different filters because each one can detect a different thing – smoothness, vertical lines, horizontal lines or whatever. - they’re like FEATURE IDENTIFIERS.
Then a POOLING layer (or “DOWNSAMPLING”) layer can be used – not actually required – but this reduces the number of features in your neural network to just the most important ones, so reducing the computational cost.
Reducing the no. features can help prevent “overfitting” in your neural network [which is where your model fits the training data too closely and then doesn’t generalise well to new examples].
Main method use in a pooling layer is MAXPOOLING – example above → convolves a filter over the output from the convolutional layer and just keeps the maximum value in each section.
Lose some resolution, but way more efficient.
3 main types of layer: convolutional, pooling, with fully connected layer at the end.
Outputs a probability of the classes best describing that image.
These are like individual layers of a neural network.
you take the image, pass it through a series of convolutional, pooling (downsampling), and fully connected layers, and get an output.
Brain-like structure for supervised learning: axons and dendrites
So the NN then takes the features (X1… Xn) and these, by applying a series of weights, get converted into a probability that the image lies in each of our output classes.
These weights get refined while training the neural network on the training set. It will compute a cost, which is effectively the difference between what we’ve predicted and what the actual result in the training set is. And it will try to minimise this cost by back-propagating the error through its network and adjusting the weights until the cost has converged on a small value.
Get out probability of image being in each of your output classes.
So to finish, here are some pictures to show some actual features that a CNN has learned from input data images of Elephants.
So to an computer these are the features that make elephants look like elephants and not cows, cars or books.