Building a Neural Network from Scratch: Part 2

In this post we’ll improve our training algorithm from the previous post. When we’re done we’ll be able to achieve 98% precision on the MNIST data set, after just 9 epochs of training—which only takes about 30 seconds to run on my laptop.

For comparison, last time we only achieved 92% precision after 2,000 epochs of training, which took over an hour!

The main driver in this improvement is just switching from batch gradient descent to mini-batch gradient descent. But we’ll also make two other, smaller improvements: we’ll add momentum to our descent algorithm, and we’ll smarten up the initialization of our network’s weights.

We’ll also reorganize our code a bit while we’re at it, making things more modular.

But first we need to import and massage our data. These steps are the same as in the previous post:

from sklearn.datasets import fetch_mldata
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# import
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]

# scale
X = X / 255

# one-hot encode labels
digits = 10
examples = y.shape[0]
y = y.reshape(1, examples)
Y_new = np.eye(digits)[y.astype('int32')]
Y_new = Y_new.T.reshape(digits, examples)

# split, reshape, shuffle
m = 60000
m_test = X.shape[0] - m
X_train, X_test = X[:m].T, X[m:].T
Y_train, Y_test = Y_new[:,:m], Y_new[:,m:]
shuffle_index = np.random.permutation(m)
X_train, Y_train = X_train[:, shuffle_index], Y_train[:, shuffle_index]

Then we’ll define our key functions. Only the last two are new, and they just put the steps of forward and backward propagation into their own functions. This tidies up the training code to follow, so that we can focus on the novel elements, especially mini-batch descent and momentum.

Notice that in the process we introduce three dictionaries:params, cache, and grads. These are for conveniently passing information back and forth between the forward and backward passes.

def sigmoid(z):
    s = 1. / (1. + np.exp(-z))
    return s

def compute_loss(Y, Y_hat):

    L_sum = np.sum(np.multiply(Y, np.log(Y_hat)))
    m = Y.shape[1]
    L = -(1./m) * L_sum

    return L

def feed_forward(X, params):

    cache = {}

    cache["Z1"] = np.matmul(params["W1"], X) + params["b1"]
    cache["A1"] = sigmoid(cache["Z1"])
    cache["Z2"] = np.matmul(params["W2"], cache["A1"]) + params["b2"]
    cache["A2"] = np.exp(cache["Z2"]) / np.sum(np.exp(cache["Z2"]), axis=0)

    return cache

def back_propagate(X, Y, params, cache):

    dZ2 = cache["A2"] - Y
    dW2 = (1./m_batch) * np.matmul(dZ2, cache["A1"].T)
    db2 = (1./m_batch) * np.sum(dZ2, axis=1, keepdims=True)

    dA1 = np.matmul(params["W2"].T, dZ2)
    dZ1 = dA1 * sigmoid(cache["Z1"]) * (1 - sigmoid(cache["Z1"]))
    dW1 = (1./m_batch) * np.matmul(dZ1, X.T)
    db1 = (1./m_batch) * np.sum(dZ1, axis=1, keepdims=True)

    grads = {"dW1": dW1, "db1": db1, "dW2": dW2, "db2": db2}

    return grads

Now for the substantive stuff.

To switch to mini-batch descent, we add another for loop inside the pass through each epoch. At each pass we randomly shuffle the training set, then iterate through it in chunks of batch_size, which we’ll arbitrarily set to 128. We’ll see the code for all this in a moment.

Next, to add momentum, we keep a moving average of our gradients. So instead of updating our parameters by doing e.g.:

params["W1"] = params["W1"] - learning_rate * grads["dW1"]

we do this:

V_dW1 = (beta * V_dW1 + (1. - beta) * grads["dW1"])
params["W1"] = params["W1"] - learning_rate * V_dW1

Finally, to smarten up our initialization, we shrink the variance of the weights in each layer. Following this nice video by Andrew Ng (whose excellent Coursera materials I’ve been relying on heavily in these posts), we’ll set the variance for each layer to $1/n$, where $n$ is the number of inputs feeding into that layer.

We’ve been using the np.random.randn() function to get our initial weights. And this function draws from the standard normal distribution. So to adjust the variance to $1/n$, we just divide by $\sqrt{n}$. In code this means that instead of doing e.g. np.random.randn(n_h, n_x), we do np.random.randn(n_h, n_x) * np.sqrt(1. / n_x).

Ok that covers our three improvements. Let’s build and train!

np.random.seed(138)

# hyperparameters
n_x = X_train.shape[0]
n_h = 64
learning_rate = 4
beta = .9
batch_size = 128
batches = -(-m // batch_size)

# initialization
params = { "W1": np.random.randn(n_h, n_x) * np.sqrt(1. / n_x),
           "b1": np.zeros((n_h, 1)) * np.sqrt(1. / n_x),
           "W2": np.random.randn(digits, n_h) * np.sqrt(1. / n_h),
           "b2": np.zeros((digits, 1)) * np.sqrt(1. / n_h) }

V_dW1 = np.zeros(params["W1"].shape)
V_db1 = np.zeros(params["b1"].shape)
V_dW2 = np.zeros(params["W2"].shape)
V_db2 = np.zeros(params["b2"].shape)

# train
for i in range(9):

    permutation = np.random.permutation(X_train.shape[1])
    X_train_shuffled = X_train[:, permutation]
    Y_train_shuffled = Y_train[:, permutation]

    for j in range(batches):

        begin = j * batch_size
        end = min(begin + batch_size, X_train.shape[1] - 1)
        X = X_train_shuffled[:, begin:end]
        Y = Y_train_shuffled[:, begin:end]
        m_batch = end - begin

        cache = feed_forward(X, params)
        grads = back_propagate(X, Y, params, cache)

        V_dW1 = (beta * V_dW1 + (1. - beta) * grads["dW1"])
        V_db1 = (beta * V_db1 + (1. - beta) * grads["db1"])
        V_dW2 = (beta * V_dW2 + (1. - beta) * grads["dW2"])
        V_db2 = (beta * V_db2 + (1. - beta) * grads["db2"])

        params["W1"] = params["W1"] - learning_rate * V_dW1
        params["b1"] = params["b1"] - learning_rate * V_db1
        params["W2"] = params["W2"] - learning_rate * V_dW2
        params["b2"] = params["b2"] - learning_rate * V_db2

    cache = feed_forward(X_train, params)
    train_cost = compute_loss(Y_train, cache["A2"])
    cache = feed_forward(X_test, params)
    test_cost = compute_loss(Y_test, cache["A2"])
    print("Epoch {}: training cost = {}, test cost = {}".format(i+1 ,train_cost, test_cost))

print("Done.")

Epoch 1: training cost = 0.15587093418167058, test cost = 0.16223940981168986
Epoch 2: training cost = 0.09417519634799829, test cost = 0.11032242938356147
Epoch 3: training cost = 0.07205872840102934, test cost = 0.0958078559246339
Epoch 4: training cost = 0.07008115814138867, test cost = 0.1010270024817398
Epoch 5: training cost = 0.05501929068580713, test cost = 0.09527116695490956
Epoch 6: training cost = 0.042663638371140164, test cost = 0.08268937190759178
Epoch 7: training cost = 0.03615501088752129, test cost = 0.08188431384719108
Epoch 8: training cost = 0.03610956910064329, test cost = 0.08675249924246693
Epoch 9: training cost = 0.027582647825206745, test cost = 0.08023754855128316
Done.

How’d we do?

cache = feed_forward(X_test, params)
predictions = np.argmax(cache["A2"], axis=0)
labels = np.argmax(Y_test, axis=0)

print(classification_report(predictions, labels))

             precision    recall  f1-score   support

          0       0.99      0.98      0.98       984
          1       0.99      0.99      0.99      1136
          2       0.98      0.98      0.98      1037
          3       0.97      0.98      0.97      1004
          4       0.97      0.98      0.98       970
          5       0.97      0.96      0.97       900
          6       0.97      0.99      0.98       942
          7       0.97      0.97      0.97      1026
          8       0.97      0.96      0.97       982
          9       0.97      0.96      0.97      1019

avg / total       0.98      0.98      0.98     10000

And there it is: 98% precision in just 9 epochs of training.