Knowledge Distillation for Traffic Sign Recognition
This is the second part of the post ResNet for Traffic Sign Classification With PyTorch, in which we will try to get close to the results of the previous model but with a smaller one. To do so we will apply the knowledge distillation technique, want to know more? Let's dive in!
- Introduction
- Imports
- Download the datasets
- Prepare the data
- Knowledge Distillation
- Training
- Test
- Conclusion
Introduction
It seems that the current trend in Deep Learning is to have bigger and more models. This makes it difficult for users to use them, fit them into small devices, get fast results etc. This is why I see model compression, knowledge distillation, and these kinds of techniques as one of the more interesting and useful topics in deep learning. After all, if you want to apply deep learning in VR/AR you need to fit them into small devices. Also for simulations and computer graphics is better to have optimized and fast models. In addition, this makes AI more affordable to everyone, democratizing access to this technology.
With this objective in mind I tried to get similar results as in my previous post, but using a resnet18 instead of a resnet34. Before continuing reading this post I encourage you to take a look at the previous one. Nevertheless, I was not successful at all. Looking for some solutions I discovered FasterAI by Nathan Hubens, which is an awesome library that implements those compression techniques based on Fastai.
So my objectives with this notebook are:
- Try to implement Knowledge Distillation technique using FasterAI to start getting familiar with the library. I would like to apply this model compression to other projects too (but that will not be covered in this notebook)
- Encourage anyone who is thinking about implementing these types of approaches that may seem complicated to try this library.
First we will upgrade the fastai version used in Colab, by deafult it is the first version.
! pip install -Uqq fastai # upgrade fastai on colab
import fastai
fastai.__version__
from fastai.imports import *
from fastai.basics import *
from fastai.vision.all import *
from fastai.callback.all import *
from fastai.data.all import *
from fastai.vision.core import PILImage
from PIL import ImageOps
import matplotlib.pyplot as plt
import csv
from collections import defaultdict, namedtuple
import os
import shutil
import pandas as pd
from sklearn.metrics import confusion_matrix, f1_score
import torch
We will use the following variable to tell Pytorch to allocate the tensors into the GPU if we have access to one. If not, all the processing will be made on the CPU (NOT recommended, it will be very slow)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
This will be analogous to the Download section in the previous post. Despite that, I have added the possibility to only train on some classes of the dataset. Because of using Knowledge Distillation we should be able to classify signals that never have been seen in the training set (or at least give an informed guess). For this notebook, we will use all the classes, but you are encouraged to try to eliminate some and see how it behaves!
c=list(range(43)) # List the classes you want to train on
# Download and unpack the training set and the test set
! wget https://sid.erda.dk/public/archives/daaeac0d7ce1152aea9b61d9f1e19370/GTSRB_Final_Training_Images.zip -P data
! wget https://sid.erda.dk/public/archives/daaeac0d7ce1152aea9b61d9f1e19370/GTSRB_Final_Test_Images.zip -P data
! wget https://sid.erda.dk/public/archives/daaeac0d7ce1152aea9b61d9f1e19370/GTSRB_Final_Test_GT.zip -P data
! unzip data/GTSRB_Final_Training_Images.zip -d data
! unzip data/GTSRB_Final_Test_Images.zip -d data
! unzip data/GTSRB_Final_Test_GT.zip -d data
# Move the test set to data/test
! mkdir data/test
! mv data/GTSRB/Final_Test/Images/*.ppm data/test
# Download class names
! wget https://raw.githubusercontent.com/georgesung/traffic_sign_classification_german/master/signnames.csv -P data
The following are some functions and code used to organize and divide our data into training, validation and test set. It is analogous to the process made in the resnet34 notebook, with some minor changes to handle the cases where not all the classes are selected. A more detailed explanation can be found in the previous post.
We have changed the read_annotations
function to allow us to filter and read only the annotations of the classes we are interested in. This is done by adding the if
inside the for
loop. So the classes will have to be passed as an argument in all the following functions.
Annotation = namedtuple('Annotation', ['filename', 'label'])
def read_annotations(filename, classes=None):
annotations = []
with open(filename) as f:
reader = csv.reader(f, delimiter=';')
next(reader) # skip header
# loop over all images in current annotations file
for row in reader:
filename = row[0] # filename is in the 0th column
label = int(row[7]) # label is in the 7th column
if classes is None or label in classes: # We only read the annotations of the classes we are interested
annotations.append(Annotation(filename, label))
return annotations
def load_training_annotations(source_path, classes=None):
annotations = []
for c in range(0,43):
filename = os.path.join(source_path, format(c, '05d'), 'GT-' + format(c, '05d') + '.csv')
annotations.extend(read_annotations(filename, classes))
return annotations
def copy_files(label, filenames, source, destination, classes=None, move=False):
func = os.rename if move else shutil.copyfile
label_path = os.path.join(destination, str(label))
if not os.path.exists(label_path):
os.makedirs(label_path)
for filename in filenames:
if classes is None or int(os.path.basename(label_path)) in classes:
destination_path = os.path.join(label_path, filename)
if not os.path.exists(destination_path):
func(os.path.join(source, format(label, '05d'), filename), destination_path)
def split_train_validation_sets(source_path, train_path, validation_path, all_path, classes, validation_fraction=0.2):
"""
Splits the GTSRB training set into training and validation sets.
"""
if not os.path.exists(train_path):
os.makedirs(train_path)
if not os.path.exists(validation_path):
os.makedirs(validation_path)
if not os.path.exists(all_path):
os.makedirs(all_path)
annotations = load_training_annotations(source_path)
filenames = defaultdict(list)
for annotation in annotations:
filenames[annotation.label].append(annotation.filename)
for label, filenames in filenames.items():
filenames = sorted(filenames)
validation_size =int(len(filenames) // 30 * validation_fraction) * 30
train_filenames = filenames[validation_size:]
validation_filenames = filenames[:validation_size]
copy_files(label, filenames, source_path, all_path, classes, move=False)
copy_files(label, train_filenames, source_path, train_path, classes, move=True)
copy_files(label, validation_filenames, source_path, validation_path, classes, move=True)
Due to the changes we have made in the previous functions we only have to add the information about the classes in the last step, passing it to the split_train_validation_sets
function.
path = 'data'
source_path = os.path.join(path, 'GTSRB/Final_Training/Images')
train_path = os.path.join(path, 'train')
validation_path = os.path.join(path, 'valid')
all_path = os.path.join(path, 'all')
validation_fraction = 0.2
split_train_validation_sets(source_path, train_path, validation_path, all_path, c, validation_fraction)
test_annotations = read_annotations('data/GT-final_test.csv')
Data preparation is even more similar to the previous post. We will only add a vocab when creating the dataset. This is because if we don't specify it Fastai will take as vocab the classes that appear in our data, which would be problematic if we want to restrict the problem to only a subset of classes.
classes = pd.read_csv('data/signnames.csv')
class_names = {}
for i, row in classes.iterrows():
class_names[str(row[0])] = row[1]
sz = 96
data = ImageDataLoaders.from_folder(path, item_tfms=[Resize(sz,order=0)], bs=256,
batch_tfms=[*aug_transforms(do_flip=False,
max_zoom=1.2,
max_rotate=10,
max_lighting=0.8,
p_lighting=0.8),
Normalize.from_stats(*imagenet_stats)],
vocab=[str(i) for i in range(43)])
We can see that our vocab will have all the classes, no matter which classes we use to train.
data.vocab
This is a technique proposed by Geoffrey E. Hinton as a way to extract the knowledge from big models and use it to train simpler architectures. Moreover, it also has proven to provide what seems to be more robust and general learning which is always desirable. We have considered this approach because of the increasing size of the models not only in computer vision tasks but also in Natural Language Processing (NLP), which makes those difficult to be used in inference for very time demanding tasks or even impossible to fit in embedded systems.
The high-level idea is to, instead of training the models with one-hot encoded vectors, train them using a โsofter versionโ of those that stores some information about the similarity between images. Is a concept similar to word embeddings in NLP. These labels are obtained from (normally) a bigger model which is already trained and used only for inference.
Here is where we start using FasterAI library. Nevertheless, I found problems loading a model into a learner with Fastai. As KnowledgeDistillation
function expects to receive a learner I had to copy and paste this function and modify it by adding the from_learner
parameter. In this way, I can set it to false and use as a teacher a plain PyTorch model. Nevertheless, the correct way of using it would be as straightforward as installing the library:
pip install fasterai
And import the things you need:
from fasterai import ...
sz = 96
bs = 256
wd = 5e-3
f1_score_mult = FBeta(beta=1, average='weighted')
To act as the teacher we will use the model we trained in the previous post which can be found in this repo. You can download it from there or train your own model using the previous notebook. Once you have the model trained you will have to set the path_to_model
variable.
If you are working in Colab you can drag and drop the model into the Colab's file explorer or save the model in your drive and mount it (just by clicking the Mount Drive button in the files tab, on the left in Colab). By mounting the drive you will be able to reach the model file by navigating across your drive directories.
path_to_model = #your_path
First, we prepare the teacher model. To do so, we will get the architecture of our model with random weights and then we will fill it with the trained weights.
create_cnn_model
fuction of Fastai, which given a model architecture it creates a model (which is a Sequential object) with the head and the body that we choose. In the previous notebook we created the Learner with the default settings, so now we will use those defaults too. For each architecture, Fastai has a way of cutting the model, and for all of them it uses the same default head, you can check the source code!
If you would want to create the model with a different head and body you would have to use the cut
and custom_head
arguments. You can find more information about it in the documentation.
The to_device
method will place it into GPU or CPU depending on the hardware available.
teacher = create_cnn_model(resnet34, 43, pretrained=False).to(device) #get the model architecture
teacher.load_state_dict(torch.load(path_to_model+'resnet34_weights.pth')) #load the trained wheights
teacher
teacher
is a Sequential object with two modules. The module (0)
is the body, and it is the original architecture of the model (in our case resnet34), which can be pretrained (not in our case). The second module, the (1): Sequential
, is the head which is initialized using kaiming_normal
initialization by default(it can be changed using the init
argument), so it is not pretrained. The layers of this head are the default used in Fastai. The decision about where to cut the original model is taken according to Fastai metadata for each architecture, but as we have said it is customizable using the cut
argument.
Our teacher model is almost ready! The only thing that last is that, as we don't want to propagate gradients through our teacher (we only want it for inference) we should make sure it is frozen.
The following PyTorch code does exactly that.
for param in teacher.parameters():
param.requires_grad = False
KnowledgeDistillation
callback we use the teacher in eval()
mode, so the gradients would not be calculated even if the model is not frozen, but is better to make sure.
Now let's prepare our student network. As before we use the create_cnn_model
, but with a pretrained resnet18 architecture, to build the model that we will pass to the Learner constructor.
student_model=create_cnn_model(resnet18, 43).to(device) #get the model architecture and pretrained wheights
Then, we could do the following:
bad_student = Learner(data, student_model, metrics=[accuracy,f1_score_mult])
bad_student.summary()
But this would unfreeze the whole student! We don't want that, we would like to have the earlier pretrained layers of the model (the body) frozen and the rest (the head) to be trainable. The reason is that, by default, Learner
does not freeze any parameter group.
So let's freeze the body. To do that we use the freeze_to(n)
method where n is the number of parameter groups we want to freeze. We take n=1 as we only want the body to be frozen.
bad_student.freeze_to(1)
bad_student.summary()
Now we have frozen the entire model! ๐ฅถ
So, how are those parameters groups chosen? By the argument splitter
, which by default is trainable_parameters
. This will return all the trainable parameters of the model. In our case, they are all trainable, so we will only get one parameter group, hence the whole student will be frozen.
The solution is to pass splitter=default_split
. This will split the parameter groups in body and head, just as we want!
lr_find()
, which does not accept any weight decay nor callback arguments (it takes the ones defined when creating the Learner). Then the correct thing to do is to pass as an argument to the learner the KnowledgeDistillation
callback, as we do with the weight decay. By doing so we have thesame training scenario when doing lr_find()
and fit_one_cycle()
.
loss = partial(SoftTarget, T=20)
kd = KnowledgeDistillation(teacher, loss, from_learner=False)
student = Learner(data, student_model, metrics=[accuracy,f1_score_mult], splitter=default_split, wd=wd, cbs=kd)
student.freeze_to(1)
#student.summary()
student.summary()
it will raise the following error: Exception occured in KnowledgeDistillation
when calling event after_loss
: cross_entropy_loss(): argument โtargetโ (position 2) must be Tensor, not NoneType.
Although, it only affect to the summary method. So we can continue and if we really want to check the learner we can remove and add back the callback, as the collapsed code shows below.
print('See in which position the KnowledgeDistillation is:')
print(student.cbs)
print()
student.remove_cb(cb=student.cbs[3]) # We could also do cb=kd
print(student.summary()) # Obviously you won't see the KnowledgeDistillation callback at the end
student.add_cb(cb=kd) # Add back the callback
print()
print('Check that you have the same list as before the summary:')
print(student.cbs)
In the training process is where FasterAI magic comes in ๐ง
Only by adding the KnowledgeDistillation
callback when fitting the model, we will be able to use this technique with almost the same code as in the previous post. But, as we setted those parameters when creating the Learner()
, we don't even need to do that!๐
We will follow the same procedure as is the previous post for finding the hyperparameters that better train our model.
Learner()
definition.
student.lr_find() # wd and callback included
student.fit_one_cycle(1, lr_max=0.001) # wd and cbs are setted in Learner()
student.unfreeze()
student.lr_find()
student.fit_one_cycle(9, lr_max=slice(0.0001, 0.001)) #, wd=wd, cbs=kd)
student.lr_find()
student.fit_one_cycle(6, lr_max=slice(0.00001, 0.0001))#, wd=wd, cbs=kd)
Once we know which configuration is better we can train the model on all the data available (training+validation).
data = ImageDataLoaders.from_folder(path, item_tfms=[Resize(sz,order=0)], bs=256,
batch_tfms=[*aug_transforms(do_flip=False,
max_zoom=1.2,
max_rotate=10,
max_lighting=0.8,
p_lighting=0.8),
Normalize.from_stats(*imagenet_stats)],
vocab=[str(i) for i in range(43)], train='all')
teacher = create_cnn_model(resnet34, 43, pretrained=False).to(device)
teacher.load_state_dict(torch.load(path_to_model+'resnet34_weights.pth'))
student_model=create_cnn_model(resnet18, 43).to(device)
student = Learner(data, student_model, metrics=[accuracy,f1_score_mult], splitter=default_split, wd=wd, cbs=kd)
student.freeze_to(1)
student.fit_one_cycle(1, lr_max=0.001)
student.unfreeze()
student.fit_one_cycle(10, lr_max=slice(0.0001, 0.001))
student.fit_one_cycle(6, lr_max=0.0001)
Like we did in the previous post we will use Test Time Augmentations, and we will use the same function to plot how the performance and inference time behave.
test_time_aug
function the mask
argument. This will allow us to compute the F1-score only for some classes. For example, only for the ones we have trained on.
def test_time_aug(learner, test_dataloader, y_true, metric, n_augs=[10], beta=0.1,mask=None):
res = []
if mask is None:
mask=list(range(len(y_true)))
learner.eval()
for aug in n_augs:
if aug == 0:
start = time.time()
log_preds,_ = learner.get_preds(dl=test_dataloader)
end = time.time()
infer_time = end-start
else:
start = time.time()
log_preds,_ = learner.tta(dl=test_dataloader, n=aug, beta=beta)
end = time.time()
infer_time = end-start
preds = np.argmax(log_preds,1)
score = metric(preds[mask], y_true[mask])
res.append([aug, score,infer_time])
print(f'N Augmentations: {aug}\tF1-score: {score}\tTime:{infer_time}')
return pd.DataFrame(res, columns=['n_aug', 'score', 'time'])
true_test_labels = {a.filename: a.label for a in test_annotations}
class_indexes = data.vocab.o2i
test_img=get_image_files('./data/test')
filenames = [filepath.name for filepath in test_img]
labels = [str(true_test_labels[filename]) for filename in filenames]
y_true = np.array([class_indexes[label] for label in labels])
test_dataloader=data.test_dl(test_img, bs=256, shuffle=False)
The following code will create the mask used for computing the F1-score. You only have to specify in which classes you want to focus on the interest_classes
variable. We will focus on all the classes that we trained with (which are all the classes).
interest_classes=c
interest_idx=[data.vocab.o2i[str(cl)] for cl in interest_classes]
mask=np.isin(y_true, interest_idx, invert=False)
invert
to True
you will get the F1-score only on the images you have not seen in the training.
KnowledgeDistallation
callback. But we don't need it anymore! (sorry Nathan, you've been a hero ๐). Let's remove it.
student.cbs
student.cbs[3]
student.remove_cb(cb=student.cbs[3])
student.cbs
metric = partial(f1_score, average='weighted')
results = test_time_aug(student, test_dataloader, y_true, metric, n_augs=[0, 5, 10, 20, 30])
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(results['n_aug'], results['score'], 'g-')
ax2.plot(results['n_aug'], results['time'], 'b--')
ax1.set_xlabel('Number Augmentations')
ax1.set_ylabel('F1 score', color='g')
ax2.set_ylabel('Time (s)', color='b')
plt.show()
As we have seen Knowledge Distillation allows us to train for more epochs without overfitting and improves the final performance of the model compared with not using it (and getting very close of the resnet34, which resulted in 0.9963 F1-score). An intuitive explanation can be that we are giving more information to the network in every step of the backpropagation by, not only providing one-hot labels, but also a probability of the image to be another class. With this we are introducing the idea of similarities between classes.
I see these types of compression approaches as very interesting and I would like to apply them in other domains, maybe to NLP or 3D data. So I cannot recommend more FasterAI library and infinitely thank the work done to Nathan Hubens. Also, if you want to learn more about how to make smaller and faster Neural Networks I encourage you to visit his blog.
Thanks for reading!๐