Press enter to see results or esc to cancel.

An Introductory Guide to Deep Neural Networks

Algorithms designed by deep learning methods have featured extensively in recent headlines. AlphaGo, developed by the Google DeepMind team was the first computer program to beat professional Go play Lee Sedol in a five-game match. Go has significantly more branching moves than chess, so traditional AI methods were unable to outperform human players. AlphaGo’s superhuman skill was initially bootstrapped from human play. The algorithm was taught to mimic expert human play taken from recorded games. Once AlphaGo reached a certain level of skill, games were set up against other instances of itself to further increase its abilities through reinforcement learning.

Image result for alpha go lee sedol

On October 18th 2017, the DeepMind team released the paper ‘Mastering the Game of Go without Human Knowledge[link]’ detailing a new algorithm – AlphaGo Zero. This new variant had been able to beat the original AlphaGo 100 games to zero. AlphaGo Zero learned this superhuman proficiency with no human data or supervised learning, beyond being taught the rules of Go. The algorithm was its own teacher – every game played against itself would would result in higher quality move selection and stronger self-play in the next iteration.

Reddit recently shut down the subreddit r/deepfakes, where users were taking celebrities faces and mapping them onto actresses bodies in pornographic films. The results were so convincing that many sites followed Reddit’s example and banned deepfake videos, citing that they do not tolerate any non-consensual content. The Deepfake app uses deep learning and neural networks to ‘learn’ the features of both the original and desired faces.

The current applications of deep learning are both fascinating and concerning. To better understand how deep learning works, I’m going to go through how deep learning can be applied to teach a computer to recognise images. Vision for humans seems easy – few people would have trouble distinguishing between a dog and a car, for example.

Image result for pug

But to train a computer to do exactly the same thing means fully understanding how our brain makes this distinction. Once we understand that, true artificial intelligence may be possible. Until then, we use mathematical representations of what’s going on in the brain – in this case, neural networks.

Neural networks simulate what goes on in our brains by taking in an input, performing some processing and then producing an output. The simplest neural network is called the perceptron, which we implemented in Python in the previous post. We used the sonar dataset from the UCI Machine Learning Repository to train our perceptron to recognise different types of objects based on the strength of the response from 60 angles of a sonar chirp, when reflecting off different objects.

When we trained the perceptron, we were fortunate enough to have our input variables in a form that the computer could easily understand. But what should a computer ‘see’ when you show it images of dogs and cars? A person’s visual input comes through their eyes. Hubel and Wiesel conducted experiments on sedated spider monkeys in the 1960s that showed individual neurons responding to small regions of the animals’ visual fields. Single neurons do not process the entire visual field at once. Instead, individual cells would activate for a smaller section of the visual field known as the receptive field. The combination of multiple firing neurons, and multiple receptive fields formed a complete map of the visual space.

This is very much a simplified explanation of the biological processes of animal vision, but the same ideas apply for building a convolutional neural net. Instead of taking all inputs at once like we did with our perceptron, we will split the image into a series of receptive fields that combine to form a representation that allows the neural net to classify the image.


Splitting the image is called convolution and involves sliding a ’tile’ over the image to find the most important parts. The best way to visualise a convolution is to imagine shining a torch over the image. In the example below, we have a 6 x 6 pixel image, and a 4 x 4 pixel square that ‘shines’ over the top left of the picture. The 4 x 4 square is called the filter, or the kernel, and the section of the picture that it lies

The values in the filter are multiplied with the original pixel values of the image. The multiplications are summed up to produce a single number. The process is repeated as the kernel slides over the image, producing a number for each position of the kernel. These numbers create an activation map as the first ‘hidden layer’

In the filter there is an array of numbers called the weights. As the filter slides across the image, each pixel from the image covered by the filter is multiplied by the a weight in the filter, and the multiplications are added together. The resulting sum is included in the activation map. The process repeats and the filter shifts across to the next set of pixels, until eventually the full activation map is calculated.

Feature identification

So what’s the motivation for running a filter over the image? When we look at a picture of a dog, we can tell it’s a dog because we see the features of a dog, like four legs and a tail. The filters work in a similar way as feature identifiers. Here is a very simple example of a 10 x 10 pixel filter that can detect a curve:

An artificial example of a filter that will activate when run over a curve

As we run this filter over an image with a curve in it, the activation map will produce a much larger number for the regions of the image with a curve than those without.

The dogs ear has a curve very similar to our filter, so the number in the corresponding part of the activation map will be relatively large

Running the curve detector over an image with a different shape produces a smaller number, indicating less of an ‘activation’ for the area without a curve.

When the filter passes over an area without the curve shape, the activation value is lower

The filters that are actually created for real images are more complex than just curve detectors, but the concept is the same. A 2013 paper by Zeiler and Fergus visualised activation maps to gain a better understanding of how the convolutional neural nets ‘saw’ the images and produced activation maps for image classification.

We can see in this example how the filters activate for certain parts of an image – an eye, the shape of a dog’s face, the legs of a bird

The first layer of convolution detects low level features of the image like straight lines and curves. Deep convolutional neural nets will have additional layers that can detect higher order features. This is what people are referring to when they talk about deep learning. Multi-layered convolutional neural nets apply convolutions to layers that have already had convolutions applied. Below we have an example of a convolutional neural network architecture with multiple convolution layers. The higher level convolutions are being applied to the resulting activation maps from the previous layers and identifying more complex features. Visualising how the model is detecting features is more difficult at this point, because the image itself has had its dimension reduced. The first layer may be able to make out the general shape of a dog, but the layers below may be picking out more complex features like legs and tails.

A real neural network can have multiple convolution layers to pick out higher level features

You may notice some additional steps like pooling and ReLU – these are important to the performance of our CNN, but we will come to what they are actually doing when we implement our model.

Fully connected layer

Once the convolution layers have detected both the high and low level features of the image, the last layer of the neural net needs to make a prediction of what the final image is. The last layer is called the fully connected layer, which looks at the outputs of the previous layer and gives an n-dimensional vector with n being the number of classes the model is attempting to predict. In the case of recognising handwritten digits, the output vector would have 10 possible values (n = 10). The vector shows the probabilities of the inputs belonging to a particular class.

    \[prediction = \begin{bmatrix}0 &  0.25 & 0.05 & 0 & 0 & 0 & 0 & 0.7 & 0 & 0\end{bmatrix}\]

The example above indicates a 25% probability that the digit is a 1, a 5% probability that the image is 2 and a 70% probability that the image is a 7. The probabilities are based on activation maps from the previous layer.

Training the model with backpropagation

So how does a convolutional neural net know which features to look for in an image? Where do the weights on the kernel matrix come from? Backpropagation is the process by which a convolutional neural net ‘learns’ to classify inputs. The training process has four steps:

  1. Forward pass
  2. Loss function
  3. Backward pass
  4. Weight update

In the first training iteration, the weights inside the kernel matrix are randomly intialised. The training image (e.g. a handwritten 8) is passed through the network for the forward pass and a prediction vector is generated. Because the network hasn’t been trained to recognise any features, the predictions will look fairly random.

    \[prediction_{initial} = \begin{bmatrix}0 &  0.25 & 0.05 & 0 & 0 & 0 & 0 & 0.7 & 0 & 0\end{bmatrix}\]

The loss function is what guides the model towards a correct prediction. For a handwritten 8, a perfect prediction looks like:

    \[prediction_{target8} = \begin{bmatrix}0 &  0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\end{bmatrix}\]

We want to get our model as close as possible to this vector every time it’s presented with a handwritten 8. To do this, we need to mathematically define the difference, or ‘loss’ between our prediction, and the perfect prediction. A common way to define a loss function is to use mean squared error:

    \[L =  \sum \frac{1}{2}(target - output)^2\]

We calculated the error the same way in the previous post when we were training our perceptron model. As you would expect, the first few training iterations will have a  very high level of error when the weights are chosen at random. Minimising the error is an optimisation problem – we need to adjust the weights inside our kernels to get as small an error as possible. More intuitively, this value is the ‘slope’ of the graph of the loss function (pictured below). We want to know which values to change to move our error value down the slope.  

Error surface of a linear neuron with two input weights

To know which weights and to what degree those weights need to be changed, we have to calculate the derivative of the loss function with respect to the weights. The full derivation of a squared error function is here. This derivative allows us to do the backward pass step, which tells us which weights contributed most to the error of the prediction. Finally, the weight update step changes the weights of the filters to go ‘down’ the slope of the loss function, multiplied by a learning rate.

    \[w_{new} = w_{old} - \eta \frac{dL}{dW}\]

Where w_{new} is the new set of weights, w_{old} is the set of weights in the previous training iteration, \eta is the learning rate, and \frac{dL}{dW} is the derivative of the loss function with respect to the weights. We choose the learning rate ourselves – high values will make bigger changes to the weights to reduce the error, but may overshoot the optimal value. Lower values will take longer to reduce the error, but may be more precise in reaching the optimum weights. The vanishing gradient problem can occur when the gradient of the error function with respect to the weights becomes ‘vanishingly small’, which prevents the weight from changing its value. This problem was solved in 2011 where Glorot, Bordes and Bengio showed that using a rectifier or ReLU activation function  (rather than the traditional logistic sigmoid or hyperbolic tangent functions) enabled more efficient gradient propagation. 

The four steps to update the model weights represent a single training iteration. We set the number of iterations for each set of training images, with the goal of having a final model that can accurately classify new images. 

There are a number of topics related to convolutional neural networks that were not covered by this post. However, the points above should give the reader a good high-level idea of how such a model is trained. In Part 2 we will implement a model that can read and classify handwritten digits, using Google TensorFlow.