AlexNet for Dog Breed Classification
This is a documentation of a lab presentation I made at Deep Learning IndabaX Nigeria, held on May 11th, 2019. This lab was on implementing a Convolutional Neural Network.
For this lab, I implemented this paper in order to teach participants how to read through a paper, and also how to implement it using the Keras sequential API in TensorFlow 2.0.
This paper showed a model that was used for classifying 1.2 million images in the 2010 ImageNet contest for classifying images into 1000 categories. Our goal in this lab was to build a model to classify dog images into 120 categories. There are 10,222 images in the training dataset for this task. This dataset is available on Kaggle.
The notebook implementing this is available in a GitHub repository hosted here. Please clone it and follow along.
The first thing you would want to do is download the training data for this challenge. The Kaggle link is provided above. You will also need the csv-format file that lists the different images along with their corresponding breed.
The lab is split into two sections. The first section is dedicated to a data pipeline. The second is dedicated to implementing the model.
We begin by installing TensorFlow 2.0. At the time of writing this article, it is in alpha release. We make use of the GPU version in order to speed-up training. This is done using the following command.
!pip install tensorflow-gpu==2.0.0-alpha0
We then go ahead with our imports.
import tensorflow as tf
from tensorflow.keras import layers, models, optimizers
import pandas as pd
import numpy as np
We need TensorFlow for a number of operations. We also import layers, models, and optimizers from tf.keras
so that we can create a sequential model and also add the various layers that we need.
We proceed to declare some constants. These are the height and width of the images we will work with inside our model, the number of channels in our images, as well as the batch size.
HEIGHT = 224
WIDTH = 224
NUM_CHANNELS = 3
BATCH_SIZE = 128
Next, we use pandas to read in the file called labels.csv
into the variable data_files
. This produces a DataFrame, which contains two columns, id
, and breed
. The first column calledid
is the name of the file in the training dataset. breed
is the categorical name of the dog breed.
We then proceed to extract the unique dog breeds from our file. This will give us a list that we can use to go from category names to numbers. We also use the opportunity to get the length of the list, which is the number of categories in our dataset.
dog_breeds = data_files['breed'].unique()
NCLASSES = len(dog_breeds)
We then proceed to engineer a new feature, which is an encoding of the categories.
data_files['breed_id'] = data_files['breed'].apply(lambda x: np.where(dog_breeds == x)[0][0])
We also engineer a new feature that gives us the absolute path to the image files.
data_files['img'] = data_files['id'].apply(lambda x: 'gs://johnthas-dog-breeds/train/{}.jpg'.format(x))
A cursory inspection of the files shows that after reading the image file, we will need to convert the input to a tensor using decode_jpeg
. This leads to the image pre-processing functions load_and_preprocess_image
and preprocess_image
. We then proceed to create a dataset from our data. We shuffle the data and place it into batches of 128.
At this point, we have our dataset ds
and we can proceed to build our model.
The model itself is straightforward once understand the architecture which is defined in the paper.
There are five convolutional layers. The first two layers have something called responsive normalization applied to them, followed by max pooling. I didn’t have much time to research responsive normalization, so I put in batch normalization. The remaining three convolution layers were stacked together, and then they were followed by a max pooling layer.
I implemented the convolution layers as Conv2D
. There is a 1D variant (for temporal data) as well as a 3D variant (for videos). There were various kernel settings, which were also implemented.
The architecture also features three fully connected layers, implemented as Dense
in our code. The first two had a 50% dropout, implemented as Dropout
in our notebook
All layers, except for the output, had relu
activation. Weight initialization, represented as kernel_initializer
was done using Gaussian distribution with 0 mean and 0.01 standard deviation. I used a RandomNormal
for that.
Bias initialization was a constant, either 0 or 1, depending on the layer.
The optimizer was Stochastic Gradient Descent, with a momentum of 0.9 and a weight decay of 5e-3.
In the paper, image augmentation was discussed as another strategy for reducing overfitting in the model, along with dropout. You can add functions for data augmentation to the preprocessing function. I will cover that in another paper.
If you follow along, you will notice that our model has 38 million parameters, as against the 60 million in the paper. That is because our output layer has only 120 neurons, as against 1000 in the paper.
Please, go ahead and try your hands on the notebook.