Press enter to see results or esc to cancel.

Building a deep neural network to read my handwriting with TensorFlow and Python

In the previous post, we looked at a high level overview of how deep neural networks are constructed and trained. Using a convolutional neural net and the TensorFlow library, we will now consider a classic visual classification problem – recognising handwritten digits.

In my rather bad handwriting, I’ve written each single digit

Neural nets need to be trained on lots and lots of labelled data to perform well. I could write out the digits above myself a few thousand times, but that would be a very boring exercise. It may be that the model ends up only being good at recognising my handwriting, and won’t generalise well to numbers written by other people.

Fortunately, we have the MNIST database, a collection of 60,000 training images and 10,000 testing images of handwritten digits, taken from American Census Bureau employees and American high school students.

A sample of the handwritten digits in the MNIST database

To translate each of these MNIST examples into a form that the computer can understand, we can represent the digit as an array of numbers that indicate how dark each pixel is.

import cv2
import numpy as np

gray = cv2.imread('number_3.png', 0)

#resize and invert colors of the digit
gray = cv2.resize(255-gray, (28, 28))

#print the matrix of numbers
print(gray)

#show transformed image
cv2.imshow('image',gray)
cv2.waitKey(0)
cv2.destroyAllWindows()
Using the Python OpenCV library, we read in each 28×28 pixel image of my handwritten digits and convert it to an array.  Our array now has 784 numerical inputs

In another previous post, we used a single-layer perceptron to train a binary classifier with 60 inputs. We could apply the same idea here, and train a model to recognise either a 3 or a ‘not 3’, based on lots of training examples. The problem with this approach is that the model will not be able to detect a 3 when the image isn’t perfectly centered in the middle of the 28×28 square. The pattern learned will only be for ‘perfect’ data, which is rare in the real-world.

When the image is offset, a single layer neural network will struggle to accurately classify the digit

As well as digits written on a page, we also want to eventually train a model to pick out objects, like a car or bicycle. Objects can show up anywhere in an image, so the model needs to be able to recognise the shape of the object of interest, independently of its location. As mentioned in the last post, convolution lets us train the neural network by dividing up the image and have the neurons fire when the receptive field runs over a ‘interesting’ part of the image.

Google TensorFlow and the power of graphics cards

TensorFlow is an open-source machine-learning library based on dataflow programming.  Originally developed by the Google Brain team for internal use, the library saw extensive use for both research and commercial applications across companies under Google’s parent company Alphabet. The name TensorFlow comes from the operations neural networks apply on multidimensional data arrays – also known as tensors. TensorFlow has a wide range of functionality, but is mainly designed for training deep neural network models.

A significant advantage of TensorFlow is its ability to take advantage of GPU hardware. Training a deep neural network involves performing matrix multiplications on the weights and inputs at each layer of the network. While these computations individually are simple, the sheer number required means that traditional CPU based calculation is simply too slow. As an example, the VGG16 neural network has 16 convolutional layers and around 140 million weights and biases. A single training iteration would take far too long using CPUs optimised for single-thread (not concurrent) computations.

A graphics card is ideal for deep learning, as most graphics processing involves running lots of operations on large matrices

A good analogy to understand why a GPU is superior to a CPU for training a neural network is to consider the choice of moving house using a Lamborghini or a using a moving truck. The Lamborghini (your CPU) is going to get your things there very quickly, but you can’t store much in the car. Lots of trips back and forth will be needed. A moving truck (the GPU) is slower, but can load up a lot more and will most likely be the faster option to move all of the items from a house.

GPUs have hundreds (sometimes thousands) of simpler cores that matrix multiplications can be spread across. Harnessing the power of GPUs for deep learning made training deep neural networks significantly quicker and more accessible to researchers. Graphics cards are much cheaper and easier to set up than dedicated, multi-CPU clusters. In 2006, Nvidia released CUDA, a high-level language for writing programs on GPUs. CUDA is one of the requirements to run TensorFlow with GPU support.

The TensorFlow documentation provides very clear instructions on how to install the library on various operating systems. I installed the library on a 64-bit desktop running Windows 10, the Anaconda distribution running Python 3.5 with CUDA® Toolkit 9.0cuDNN v7.0 (requires you to register as an Nvidia developer) and a GTX 970 GPU with the latest drivers installed.

Building a convolutional neural net with TensorFlow

The TensorFlow documentation provides an excellent tutorial for building a convolutional neural network for handwriting recognition. To better understand the architecture of a deep neural network, I have expanded on some of the details of the tutorial that were not covered in the previous post.

The MNIST classifier tutorial in the TensorFlow documentation suggests the following neural net architecture for handwritten digit classification:

  1. Convolutional Layer #1: Applies 32 5×5 filters (extracting 5×5-pixel subregions), with ReLU activation function
  2. Pooling Layer #1: Performs max pooling with a 2×2 filter and stride of 2 (which specifies that pooled regions do not overlap)
  3. Convolutional Layer #2: Applies 64 5×5 filters, with ReLU activation function
  4. Pooling Layer #2: Again, performs max pooling with a 2×2 filter and stride of 2
  5. Dense Layer #1: 1,024 neurons, with dropout regularization rate of 0.4 (probability of 0.4 that any given element will be dropped during training)
  6. Dense Layer #2 (Logits Layer): 10 neurons, one for each digit target class (0–9).

A few of these terms are new, so we’ll take a look at them now.

ReLU activation function

Previously we looked at activation maps which mapped out the parts of an image that were important for classifying the image. In biologically inspired networks, the activation function is a mathematical representation of the rate of action potential firing in the cell. Given an input, the activation function of the neuron determines whether the neuron fires or ‘activates’. We need activation functions in neural networks to determine the output e.g. 0 indicating an image is not a 3 or 1 indicating that it is a 3.

There are many different types of activation functions, but for now we will focus on two – the sigmoid and the ReLU.

Source: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6

The sigmoid function has the following benefits:

  1. The output of the function exists between 0 and 1. If we want to predict the probability as an output, the sigmoid is ideal.
  2. We can find the derivative or slope of the sigmoid at any point. This is important when training our model, as the gradient of our activation function indicates which way the weights should be adjusted during backpropogation. 
  3. The sigmoid is monotonic which means it is either entirely non-increasing or non-decreasing. During backpropogation, the influence of each neuron is calculated and the weights are adjusted. If the activativation function isn’t monotonic, then increasing a neuron’s weight may cause it to have less influence – the opposite of the intended effect. This would result in chaotic behavior during training, and the network would likely not converge to give an accurate classifier.

The sigmoid was one of the first types of activation functions used in multi-layer neural networks. The main difficulty with training a neural network with backpropogation and sigmoid activation functions is the vanishing gradient problem. From Wikipedia:

In such methods, each of the neural network’s weights receives an update proportional to the gradient of the error function with respect to the current weight in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training.

The rectifier or rectified liner unit (ReLU) was introduced in a paper by Hahnloser et al. in a 2000 paper in Nature. The definition of the ReLU is R(z)=max(0,z) where z=Wx+b The ReLU has the following features:

  1. Differentiable at any point (except z=0), so we know which way to adjust the weights during backpropogation.
  2. The gradient is constant for all inputs greater than 0, which means the ReLU does not suffer from the vanishing gradient problem. The constant gradient of the ReLU also results in faster learning.
  3. Rather than the gradient becoming ‘vanishingly small’ with the sigmoid, the gradient of the ReLU is either 0 for z<0 or 1 for z>0. We can add as many layers to the network as we like, because multiplying the gradients won’t cause the gradient to vanish.

As of 2018 the ReLU is the most popular activation function used for training deep neural networks.

Pooling Layer

The pooling layer is also known as the downsampling layer, and is applied after a convolutional layer. There are a number of different types of pooling layers, but the one we include in our model is maxpooling. Similar to the convolution step, the pooling step takes a filter of size 2×2 and slides it over the input volume. The output for each slide of the filter is the maximum number in each sub region that the filter convolves around.

Source: https://medium.freecodecamp.org/understanding-capsule-networks-ais-alluring-new-architecture-bdb228173ddc

The convolution step will extract specific features from the original input volume (a curve for example) and give a high activation value. The motivation behind the pooling layer is that the exact location of a feature is not as important as its location relative to other features identified. Pooling also reduces the spatial dimension of the input volume, reducing the number of parameters, weights and overall computation cost. When a pooling layer is not included, the model may over fit to the exact dimensions of the training data and not generalise well.

Building the MNIST classifier in TensorFlow

Input layer

Tensorflow makes it very easy to implement each of the layers of a deep neural network as described above. We start with the input layer:

def cnn_model_fn(features, labels, mode):
  """Model function for CNN."""
  # Input Layer
  input_layer = tf.reshape(features["x"]
   ,[-1, 28, 28, 1])

The reshape step is required as the convolutional and pooling layers need the inputs to be in a particular shape, as defined in the TensorFlow documentation.

Convolutional Layer 1

For the first convolution of the 28×28 pixel images from the MNIST dataset, we will apply a 5×5 filter with a ReLU activation function:

conv1 = tf.layers.conv2d(
    inputs=input_layer,
    filters=32,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu)

The padding argument makes the output tensor have the same width and height values as the input tensor. The filters parameter lets us set the number of different filters that will be applied at this layer. In the last post, we only really considered a single filter, the curve detector.  When training a real neural network, we want to be able to detect different features on each level. One filter on the first layer may detect curves, while another filter on the first layer will activate for straight lines in an image.

The output tensor will have the same 2-dimensional size as the input, but with 32 channels holding the output from the filters.

Pooling Layer 1

Taking the output of the first convolutional layer, we apply a pooling layer:

pool1 = tf.layers.max_pooling2d(
    inputs=conv1
   ,pool_size=[2, 2]
   ,strides=2)

As described above, we use a pool size of 2×2. The strides parameter specifies how many pixels the filter ‘slides’ along the image each time. Setting the stride to 2 for a pool size of 2×2 means that the filter regions don’t overlap. The resulting tensor is now 14×14 pixels, with 32 channels.

Convolutional Layer 2 and Pooling Layer 2

Our neural network is getting deeper now. Similar to the previous layers, we apply another convolutional layer and pooling layer:

conv2 = tf.layers.conv2d(
    inputs=pool1,
    filters=64,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu)

pool2 = tf.layers.max_pooling2d(
   inputs=conv2
  ,pool_size=[2, 2]
  ,strides=2)

In the second convolutional layer, there are 64 filters applied. Choosing the ‘correct’ number of filters for a deep neural network is still something of a dark art. The general approach is to pick powers of 2 to optimise GPU usage and to follow the patterns from other successful deep neural nets. The second convolution and pooling layers produce a 7×7 tensor with 64 channels.

Dense Layer

The dense layer connects to the flattened values from the feature map from the second pooling layer. The dense layer is made up of 1024 neurons with ReLU activation functions. To help prevent over fitting, we add a dropout layer, which randomly excludes 40% of the neurons during training. The third argument in the dropout function specifies that the dropout is only performed if the model is in training mode.

# Dense Layer
  pool2_flat = tf.reshape(pool2
   ,[-1, 7 * 7 * 64])

  dense = tf.layers.dense(inputs=pool2_flat
   ,units=1024
   ,activation=tf.nn.relu)
  
  dropout = tf.layers.dropout(inputs=dense
   ,rate=0.4
   ,training=mode == tf.estimator.ModeKeys.TRAIN)

Logits Layer and predictions
The final prediction layer has 10 neurons, corresponding to the 10 possible predictions for handwritten digits. The predictions come in two forms – as a single prediction by taking the highest raw value from the logits tensor and as a vector of probabilities that the input belongs to each class. The predictions are compiled in a dictionary, and an EstimatorSpec object is returned.

logits = tf.layers.dense(inputs=dropout
   ,units=10)
predictions = {
    "classes": tf.argmax(input=logits
   ,axis=1),
    "probabilities": tf.nn.softmax(logits
   ,name="softmax_tensor")
}
if mode == tf.estimator.ModeKeys.PREDICT:
  return tf.estimator.EstimatorSpec(mode=mode
   ,predictions=predictions)

Loss calculation

Because we are dealing with a multi-class classification problem, we use cross entropy to measure how closely the model’s predictions match the target classes. Cross entropy requires that our classification labels correspond to one-hot encoding. Fortunately the TensorFlow library makes it very easy to convert to convert the labels. To calculate the loss, we apply the softmax function to the logits layer and calculate cross entropy.

onehot_labels = tf.one_hot(indices=tf.cast(labels
   ,tf.int32), depth=10)
loss = tf.losses.softmax_cross_entropy(
    onehot_labels=onehot_labels 
   ,logits=logits)

Now that we have a function for the loss, we want to minimise the value of the function during training. For the digit classification model, we use a learning rate of 0.001 and stochastic gradient descent as the optimisation algorithm.

if mode == tf.estimator.ModeKeys.TRAIN:
  optimizer = tf.train.GradientDescentOptimizer(
   learning_rate=0.001)
  train_op = optimizer.minimize(
      loss=loss
     ,global_step=tf.train.get_global_step())
  return tf.estimator.EstimatorSpec(mode=mode
   ,loss=loss
   ,train_op=train_op)

Finally, we want to be able to measure the accuracy of the model:

eval_metric_ops = {
    "accuracy": tf.metrics.accuracy(
        labels=labels
   ,predictions=predictions["classes"])}
return tf.estimator.EstimatorSpec(
    mode=mode
   ,loss=loss
   ,eval_metric_ops=eval_metric_ops)

Load training and test data

The MNIST dataset has 60,000 training samples and 10,000 test samples of individual digits. We store the raw pixel values and the training labels in numpy arrays:

def main(unused_argv):
  # Load training and eval data
  mnist = tf.contrib.learn.datasets.load_dataset("mnist")
  train_data = mnist.train.images 
  train_labels = np.asarray(mnist.train.labels
   ,dtype=np.int32)
  eval_data = mnist.test.images 
  eval_labels = np.asarray(mnist.test.labels
   ,dtype=np.int32)

TensorFlow Estimators
TensorFlow provides Estimator classes, designed for high-level model training and evaluation. Pre-made Estimators are subclasses of the tf.estimator.Estimator base class. Custom Estimators are instantiated from this base class:

Premade estimators are sub-classes of `Estimator`. Custom Estimators are usually (direct) instances of `Estimator`

We create our MNIST classifier with the Estimator class, and specify the model as the cnn_model_fn created earlier. The model_dir directory stores the data for the model, which includes checkpoints to continue training a previously saved model (useful if we want to train on more data).

Training the MNIST classifier

Next we specify the training data, passing the feature data (pixels of each image) and the training labels (0-9). The batch size is set to 100 examples for each training step, and the number of steps is set to 10,000. We set the number of training epochs to none, so the model will continue training until the number of steps has been reached.

# Train the model
train_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"x": train_data},
    y=train_labels,
    batch_size=100,
    num_epochs=None,
    shuffle=True)
mnist_classifier.train(
    input_fn=train_input_fn,
    steps=10000,
    hooks=[logging_hook])

Evaluating the classifier
To measure the accuracy of the model after training, we use the evaluate method to measure the accuracy of the model.

# Evaluate the model and print results
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"x": eval_data},
    y=eval_labels,
    num_epochs=1,
    shuffle=False)
eval_results = mnist_classifier.evaluate(
   input_fn=eval_input_fn)
print(eval_results)

Running the model

To summarise, we have coded:

  1. Convolutional neural net architecture;
  2. Custom Estimator object with the CNN as the model function; and
  3.  Training/evaluation logic.

The final output from training our neural net on the MNIST dataset:

INFO:tensorflow:Saving dict for global step 10000: 
 accuracy = 0.9521
,global_step = 10000
,loss = 0.16323692
{'accuracy': 0.9521
,'loss': 0.16323692
,'global_step': 10000}

95.21% classification accuracy – let’s see how it performs on my own handwritten digits:

Running the classifier on new data

Using the image of my handwriting at the start of this post, I extracted each digit and saved them as individual .png files. When evaluating the performance of the MNIST classifier, the test images and labels were stored in numpy arrays:

eval_data = mnist.test.images 
eval_labels = np.asarray(mnist.test.labels
   ,dtype=np.int32)

We need to get the new data in this format to test the model. First we build an array (filled with zeroes initially) to store the flattened pixel values of each image. There are only 10 examples, so the labels can be manually specified:

tc_digits = np.zeros((10,784))
tc_labels = np.array([0,1,2,3,4,5,6,7,8,9])

The next step is to open each image, resize, invert the colors and convert the pixels to exclusively black or white. It is critically important to make sure that our testing data is in as similar format as possible to the original MNIST training data.

 for n in tc_labels:
      print(n)
      # read image (0 argument specifies grayscale, as 
      # we only want a single colour channel
      gray = cv2.imread("tc_digit_" + str(n) + ".png", 0)
      # resize and invert colors of the digit
      gray = cv2.resize(255-gray, (28, 28))
      # convert the grayscale image to black and white 
      (thresh, gray) = cv2.threshold(gray, 128, 255
                      , cv2.THRESH_BINARY | cv2.THRESH_OTSU)
      # save the converted image for reference
      cv2.imwrite("tc_digit_" + str(n) + "_gray.png",gray)
      # flatten the image
      flatten = gray.flatten() / 255.0
      # add the flattened digit to the tc_digits array
      tc_digits[n] = flatten      
  tc_digits = tc_digits.astype(dtype = "float32")

Now we run the new data through the saved classifier:

mnist_classifier = tf.estimator.Estimator(
      model_fn=cnn_model_fn
    , model_dir="/tmp/mnist_convnet_model/new")
  eval_input_fn = tf.estimator.inputs.numpy_input_fn(
      x={"x": tc_digits},
      y=tc_labels,
      num_epochs=1,
      shuffle=False)
  predictions = mnist_classifier.predict(input_fn=eval_input_fn)

Let’s see what the class predictions and probabilities are for the new digits:

for class_predictions in predictions:
    print(class_predictions['classes'])
    list_class_predictions = class_predictions['probabilities'].tolist()
    list_class_predictions =  ["%.2f" % prob for prob in list_class_predictions]
    print(list_class_predictions)

90% accuracy! Not bad. Let’s have a look at the incorrect prediction for the ‘8’:

As it turns out, my 8s look a lot like 6s – which is exactly what the classifier predicted. The high level API in TensorFlow makes it very easy to construct a deep neural network. The full code for the construction, training and evaluation of the neural network above can be found here.