Building a simple neural network in PyTorch
We will now walk through building a neural network from scratch in PyTorch. Here, we have a small .csv file containing several examples of images from the MNIST dataset. The MNIST dataset consists of a collection of hand-drawn digits between 0 and 9 that we want to attempt to classify. The following is an example from the MNIST dataset, consisting of a hand-drawn digit 1:
These images are 28x28 in size: 784 pixels in total. Our dataset in train.csv consists of 1,000 of these images, with each consisting of 784 pixel values, as well as the correct classification of the digit (in this case, 1).
Loading the data
We will begin by loading the data, as follows:
- First, we need to load our training dataset, as follows:
train = pd.read_csv("train.csv")
train_labels = train['label'].values
train = train.drop("label",axis=1).values.reshape(len(train),1,28,28)
Notice that we reshaped our input to (1, 1, 28, 28), which is a tensor of 1,000 images, each consisting of 28x28 pixels.
- Next, we convert our training data and training labels into PyTorch tensors so they can be fed into the neural network:
X = torch.Tensor(train.astype(float))
y = torch.Tensor(train_labels).long()
Note the data types of these two tensors. A float tensor comprises 32-bit floating-point numbers, while a long tensor consists of 64-bit integers. Our X features must be floats in order for PyTorch to be able to compute gradients, while our labels must be integers within this classification model (as we're trying to predict values of 1, 2, 3, and so on), so a prediction of 1.5 wouldn't make sense.
Building the classifier
Next, we can start to construct our actual neural network classifier:
class MNISTClassifier(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 392)
self.fc2 = nn.Linear(392, 196)
self.fc3 = nn.Linear(196, 98)
self.fc4 = nn.Linear(98, 10)
We build our classifier as if we were building a normal class in Python, inheriting from nn.Module in PyTorch. Within our init method, we define each of the layers of our neural network. Here, we define fully connected linear layers of varying sizes.
Our first layer takes 784 inputs as this is the size of each of our images to classify (28x28). We then see that the output of one layer must have the same value as the input of the next one, which means our first fully connected layer outputs 392 units and our second layer takes 392 units as input. This is repeated for each layer, with them having half the number of units each time until we reach our final fully connected layer, which outputs 10 units. This is the length of our classification layer.
Our network now looks something like this:
Here, we can see that our final layer outputs 10 units. This is because we wish to predict whether each image is a digit between 0 and 9, which is 10 different possible classifications in total. Our output is a vector of length 10 and contains predictions for each of the 10 possible values of the image. When making a final classification, we take the digit classification that has the highest value as the model's final prediction. For example, for a given prediction, our model might predict the image is type 1 with a probability of 10%, type 2 with a probability of 10%, and type 3 with a probability of 80%. We would, therefore, take type 3 as the prediction as it was predicted with the highest probability.
Implementing dropout
Within the init method of our MNISTClassifier class, we also define a dropout method in order to help regularize the network:
self.dropout = nn.Dropout(p=0.2)
Dropout is a way of regularizing our neural networks to prevent overfitting. On each training epoch, for each node in a layer that has dropout applied, there is a probability (here, defined as p = 20%) that each node within the layer will not be used in training/backpropagation. This means that when training, our network becomes robust toward overfitting since each node will not be used in every iteration of the training process. This prevents our network from becoming too reliant on predictions from specific nodes within our network.
Defining the forward pass
Next, we define the forward pass within our classifier:
def forward(self, x):
x = x.view(x.shape[0], -1)
x = self.dropout(F.relu(self.fc1(x)))
x = self.dropout(F.relu(self.fc2(x)))
x = self.dropout(F.relu(self.fc3(x)))
x = F.log_softmax(self.fc4(x), dim=1)
The forward() method within our classifier is where we apply our activation functions and define where dropout is applied within our network. Our forward method defines the path our input will take through the network. It first takes our input, x, and reshapes it for use within the network, transforming it into a one-dimensional vector. We then pass it through our first fully connected layer and wrap it in a ReLU activation function to make it non-linear. We also wrap it in our dropout, as defined in our init method. We repeat this process for all the other layers in the network.
For our final prediction layer, we wrap it in a log softmax layer. We will use this to easily calculate our loss function, as we will see next.
Setting the model parameters
Next, we define our model parameters:
model = MNISTClassifier()
loss_function = nn.NLLLoss()
opt = optim.Adam(model.parameters(), lr=0.001)
We initialize an instance of our MNISTClassifier class as a model. We also define our loss as a Negative Log Likelihood Loss:
Let's assume our image is of a number 7. If we predict class 7 with probability 1, our loss will be -log(1) = 0, but if we only predict class 7 with probability 0.7, our loss will be -log(0.7) = 0.3. This means that our loss approaches infinity the further away from the correct prediction we are:
This is then summed over all the correct classes in our dataset to compute the total loss. Note that we defined a log softmax when building the classifier as this already applies a softmax function (restricting the predicted output to be between 0 and 1) and takes the log. This means that log(y) is already calculated, so all we need to do to compute the total loss on the network is calculate the negative sum of the outputs.
We will also define our optimizer as an Adam optimizer. An optimizer controls the learning rate within our model. The learning rate of a model defines how big the parameter updates are during each epoch of training. The larger the size of the learning rate, the larger the size of the parameter updates during gradient descent. An optimizer dynamically controls this learning rate so that when a model is initialized, the parameter updates are large. However, as the model learns and moves closer to the point where loss is minimized, the optimizer controls the learning rate, so the parameter updates become smaller and the local minimum can be located more precisely.
Training our network
Finally, we can actually start training our network:
- First, create a loop that runs once for each epoch of our training. Here, we will run our training loop for 50 epochs. We first take our input tensor of images and our output tensor of labels and transform them into PyTorch variables. A variable is a PyTorch object that contains a backward() method that we can use to perform backpropagation through our network:
for epoch in range(50):
images = Variable(X)
labels = Variable(y)
- Next, we call zero_grad() on our optimizer to set our calculated gradients to zero. Within PyTorch, gradients are calculated cumulatively on each backpropagation. While this is useful in some models, such as when training RNNs, for our example, we wish to calculate the gradients from scratch after each epoch, so we make sure to reset the gradients to zero after each pass:
opt.zero_grad()
- Next, we use our model's current state to make predictions on our dataset. This is effectively our forward pass as we then use these predictions to calculate our loss:
outputs = model(images)
- Using the outputs and the true labels of our dataset, we calculate the total loss of our model using the defined loss function, which in this case is the negative log likelihood. On calculating this loss, we can then make a backward() call to backpropagate our loss through the network. We then use step() using our optimizer in order to update our model parameters accordingly:
loss = loss_function(outputs, labels)
loss.backward()
opt.step()
- Finally, after each epoch is complete, we print the total loss. We can observe this to make sure our model is learning:
print ('Epoch [%d/%d] Loss: %.4f' %(epoch+1, 50, loss.data.item()))
In general, we would expect the loss to decrease after every epoch. Our output will look something like this:
Making predictions
Now that our model has been trained, we can use this to make predictions on unseen data. We begin by reading in our test set of data (which was not used to train our model):
test = pd.read_csv("test.csv")
test_labels = test['label'].values
test = test.drop("label",axis=1).values.reshape(len(test), 1,28,28)
X_test = torch.Tensor(test.astype(float))
y_test = torch.Tensor(test_labels).long()
Here, we perform the same steps we performed when we loaded our training set of data: we reshape our test data and transform it into PyTorch tensors. Next, to predict using our trained model, we simply run the following command:
preds = model(X_test)
In the same way that we calculated our outputs on the forward pass of our training data in our model, we now pass our test data through the model and obtain predictions. We can view the predictions for one of the images like so:
print(preds[0])
This results in the following output:
Here, we can see that our prediction is a vector of length 10, with a prediction for each of the possible classes (digits between 0 and 9). The one with the highest predicted value is the one our model chooses as its prediction. In this case, it is the 10th unit of our vector, which equates to the digit 9. Note that since we used log softmax earlier, our predictions are logs and not raw probabilities. To convert these back into probabilities, we can just transform them using x.
We can now construct a summary DataFrame containing our true test data labels, as well as the labels our model predicted:
_, predictionlabel = torch.max(preds.data, 1)
predictionlabel = predictionlabel.tolist()
predictionlabel = pd.Series(predictionlabel)
test_labels = pd.Series(test_labels)
pred_table = pd.concat([predictionlabel, test_labels], axis=1)
pred_table.columns =['Predicted Value', 'True Value']
display(pred_table.head())
This results in the following output:
Note how the torch.max() function automatically selects the prediction with the highest value. We can see here that, based on a small selection of our data, our model appears to be making some good predictions!
Evaluating our model
Now that we have some predictions from our model, we can use these predictions to evaluate how good our model is. One rudimentary way of evaluating model performance is accuracy, as discussed in the previous chapter. Here, we simply calculate our correct predictions (where the predicted image label is equal to the actual image label) as a percentage of the total number of predictions our model made:
preds = len(predictionlabel)
correct = len([1 for x,y in zip(predictionlabel, test_labels) if x==y])
print((correct/preds)*100)
This results in the following output:
Congratulations! Your first neural network was able to correctly identify almost 90% of unseen digit images. As we progress, we will see that there are more sophisticated models that may lead to improved performance. However, for now, we have demonstrated that creating a simple deep neural network is very simple using PyTorch. This can be coded up in just a few lines and leads to performance above and beyond what is possible with basic machine learning models such as regression.