Transfer Learning

Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.

What is Transfer Learning?

When come to practical situations, we will mostly use a pre-trained model.

“In practice, very few people train an entire Convolutional Network from scratch (with random initialization), because it is relatively rare to have a data set of sufficient size. Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest.”

The hard work of optimizing the parameters has already been done for you, now what you have to do is fine-tune the model by playing with the hyperparameters so in that sense, a pre-trained model is a life-saver.

Transfer Learning Scenarios

Note: It’s common to use a smaller learning rate for ConvNet weights when doing Transfer Learning.

The three major Transfer Learning scenarios look as follows:

Feature Extractor

Use the representations learned by a previous network to extract meaningful features from new samples. You simply add a new classifier, which will be trained from scratch, on top of the pretrained model so that you can repurpose the feature maps learned previously for the dataset.

You do not need to (re)train the entire model. The base convolutional network already contains features that are generically useful for classifying pictures. However, the final, classification part of the pretrained model is specific to the original classification task, and subsequently specific to the set of classes on which the model was trained.

  • Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer
  • then treat the rest of the ConvNet as a fixed feature extractor for the new dataset
  • train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset

Fine-Tuning

Unfreeze a few of the top layers of a frozen model base and jointly train both the newly-added classifier layers and the last layers of the base model. This allows us to “fine-tune” the higher-order feature representations in the base model in order to make them more relevant for the specific task.

  • It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns)
  • only fine-tune some higher-level portion of the network

Pretrained Models

  • Since modern ConvNets take 2-3 weeks to train across multiple GPUs on ImageNet, it is common to see people release their final ConvNet checkpoints for the benefit of others who can use the networks for fine-tuning. For example, the Caffe library has a Model Zoo where people share their network weights.

How to Decide the Type of Transfer Learning

How do you decide what type of transfer learning you should perform on a new dataset? This is a function of several factors, but the two most important ones are the size of the new dataset (small or big), and its similarity to the original dataset (e.g. ImageNet-like in terms of the content of images and the classes, or very different, such as microscope images).

Here are the factors you need to aware of:

  • Size of the new dataset you want to train
  • Whether your new dataset is similar to original or not

New dataset is small and similar to original dataset

  • Train a linear classifier on the CNN codes (Feature Extractor)

New dataset is large and similar to the original dataset

  • fine-tune through the full network (Fine-Tuning)
    • we can have more confidence that we won’t overfit

New dataset is small but different from the original dataset

  • Train a linear classifier on the CNN codes (Feature Extractor)
  • or train the SVM classifier from activations somewhere earlier in the network.

New dataset is large but different from the original dataset

  • we may expect that we can afford to train a ConvNet from scratch
  • we would have enough data and confidence to fine-tune through the entire network

Code Example - Feature Extractor

In this example will be showcasing how to use a VGG16 model to do without training it again.

More Info can be found in the video.

Load the VGG16 Model and Store it into a new model

We need to download the VGG16 model (need internet) and then store it into a new model.

Note we are doing Feature Extraction, we won’t need the last softmax layer as we don’t have 1000 classes to classify. (See the Transfer Learning Image Above)

Import TensorFlow

1
import tensorflow as tf

Load the VGG16 model

1
vgg16_model = tf.keras.applications.vgg16.VGG16()
1
vgg16_model.summary()
1
2
type(vgg16_model)
# tensorflow.python.keras.engine.training.Model

It is a type Model , not a type Sequential

Therefore we need to transform the Model into Sequential object.

Store All the Layers EXCEPT Softmax Layer (Last FC layer)

1
2
3
model = Sequential()
for layer in vgg16_model.layers[:-1]:
model.add(layer)
1
model.summary()

You can also choose to load the full model and then use model.layers.pop() to remove the last FC layer.

Freeze the weights and bias from the model

Also called trainable parameters.

Freeze the trainable parameters

We don’t want to mess with the Trained VGG16 model.

It’s weight will never be updated if layer.trainable = False.

1
2
for layer in model.layers:
layer.trainable = False

Modify last FC layer

In this example, we want to use the model to classify cats and dogs (2 classes).

Therefore we will add a Dense layer with only 2 nodes and applty softmax activation.

  • Keep an eye on the trainable parameters and non-trainable at model.summary().

Add Softmax Classifier

1
model.add(Dense(units=2, activation='softmax'))
1
model.summary()

Train the fine-tuned VGG16 model

Setting Optimizer

1
2
# categorical_crossentropy because one-hot encoding is applied already
model.compile(Adam(lr=0.0001), loss='categorical_crossentropy', metrics=['accuracy']

It’s common to use a smaller learning rate for ConvNet weights that are being fine-tuned, in comparison to the (randomly-initialized) weights for the new linear classifier that computes the class scores of your new dataset. This is because we expect that the ConvNet weights are relatively good, so we don’t wish to distort them too quickly and too much (especially while the new Linear Classifier above them is being trained from random initialization).

Start Training

The Training code is actually the exact same code we use to train our model.

1
2
3
4
5
6
model.fit_generator(train_batches,
steps_per_epoch=4,
validation_data=valid_batches,
validation_steps=4,
epochs=5,
verbose=2)
  • .fit is used when the entire training dataset can fit into the memory and no data augmentation is applied.
  • .fit_generator is used when either we have a huge dataset to fit into our memory or when data augmentation needs to be applied.

Reference

CS231n - Transfer Learning

TensorFlow Core - Transfer learning and fine-tuning