Image classification using deep learning
The most important step in solving any real-world problem is to get the data. Kaggle provides a huge number of competitions on different data science problems. We will pick one of the problems that arose in 2014, which we will use to test our deep learning algorithms in this chapter and improve it in Chapter 5, Deep Learning for Computer Vision, which will be on Convolution Neural Networks (CNNs) and some of the advanced techniques that we can use to improve the performance of our image recognition models. You can download the data from https://www.kaggle.com/c/dogs-vs-cats/data. The dataset contains 25,000 images of dogs and cats. Preprocessing of data and the creation of train, validation, and test splits are some of the important steps that need to be performed before we can implement an algorithm. Once the data is downloaded, taking a look at it, it shows that the folder contains images in the following format:
Most of the frameworks make it easier to read the images and tag them to their labels when provided in the following format. That means that each class should have a separate folder of its images. Here, all cat images should be in the cat folder and dog images in the dog folder:
Python makes it easy to put the data into the right format. Let's quickly take a look at the code and, then, we will go through the important parts of it:
path = '../chapter3/dogsandcats/'
#Read all the files inside our folder.
files = glob(os.path.join(path,'*/*.jpg'))
print(f'Total no of images {len(files)}')
no_of_images = len(files)
#Create a shuffled index which can be used to create a validation data set
shuffle = np.random.permutation(no_of_images)
#Create a validation directory for holding validation images.
os.mkdir(os.path.join(path,'valid'))
#Create directories with label names
for t in ['train','valid']:
for folder in ['dog/','cat/']:
os.mkdir(os.path.join(path,t,folder))
#Copy a small subset of images into the validation folder.
for i in shuffle[:2000]:
folder = files[i].split('/')[-1].split('.')[0]
image = files[i].split('/')[-1]
os.rename(files[i],os.path.join(path,'valid',folder,image))
#Copy a small subset of images into the training folder.
for i in shuffle[2000:]:
folder = files[i].split('/')[-1].split('.')[0]
image = files[i].split('/')[-1]
os.rename(files[i],os.path.join(path,'train',folder,image))
All the preceding code does is retrieve all the files and pick 2,000 images for creating a validation set. It segregates all the images into the two categories of cats and dogs. It is a common and important practice to create a separate validation set, as it is not fair to test our algorithms on the same data it is trained on. To create a validation dataset, we create a list of numbers that are in the range of the length of the images in a shuffled order. The shuffled numbers act as an index for us to pick a bunch of images for creating our validation dataset. Let's go through each section of the code in detail.
We create a file using the following code:
files = glob(os.path.join(path,'*/*.jpg'))
The glob method returns all the files in the particular path. When there are a huge number of images, we can also use iglob, which returns an iterator, instead of loading the names into memory. In our case, we have only 25,000 filenames, which can easily fit into memory.
We can shuffle our files using the following code:
shuffle = np.random.permutation(no_of_images)
The preceding code returns 25,000 numbers in the range from zero to 25,000 in a shuffled order, which we will use as an index for selecting a subset of images to create a validation dataset.
We can create a validation code, as follows:
os.mkdir(os.path.join(path,'valid'))
for t in ['train','valid']:
for folder in ['dog/','cat/']:
os.mkdir(os.path.join(path,t,folder))
The preceding code creates a validation folder and creates folders based on categories (cats and dogs) inside train and valid directories.
We can shuffle an index with the following code:
for i in shuffle[:2000]:
folder = files[i].split('/')[-1].split('.')[0]
image = files[i].split('/')[-1]
os.rename(files[i],os.path.join(path,'valid',folder,image))
In the preceding code, we use our shuffled index to randomly pick 2000 different images for our validation set. We do something similar for the training data to segregate the images in the train directory.
As we have the data in the format we need, let's quickly look at how to load the images as PyTorch tensors.