Daily Dose of Data Science – Day 8 – Scenarios in which Transfer Learning may not work

by Aditya
in Daily Dose of Data Science, Data Science, Deep Learning
on September 21, 2021

In one of the previous editions of Daily Dose of Data Science I had talked about how TensorFlow Hub can accelerate modeling of unstructured data using the concept of transfer learning. But there are many possible scenarios in which Transfer Learning may not be the most effective option to go for. In this post, I will discuss about those specific situations in which Transfer Learning should be avoided or failure of Transfer Learning can be justified.

The entire concept of transfer learning is inspired by human thinking and knowledge process where we apply domain knowledge adaptations for solving certain new problems which we just got exposed by applying previously acquired knowledge related to other domains or solving other tasks. For example, if you already know how to ride a bicycle, it might just take a couple of days for your to learn how to ride a motor bike. But let’s say if you don’t know how to ride a bicycle, it might take you more time to learn how to balance a two wheeler vehicle and much more time to then ride a motor bike! So, the concept of transfer learning is also very similar in the world of Deep Learning and it is especially useful when we have very limited dataset and very limited computational resources to train our models. Now, although for this post I will not discuss about how transfer learning works but rather I will discuss about certain specific scenarios in which transfer learning might fail horribly and instead of utilizing cross domain knowledge, it can create more domain confusion.

Scenarios in which Transfer Learning might fail!

#1 – If the underlying task of the pretrained models for using Transfer Learning is completely different from the task on hand

There are plenty of recent research work which are available that proves that if the underlying task of the pretrained models are different for the current task for which the model is designed, random weight initialization is much better than using the weights of pretrained models and thereby indicating the failure of transfer learning in such cases. For example, if you want to use pretrained models on ImageNet dataset (which is primarily designed for solving classification based problems) for solving object detection based problem, transfer learning will fail miserably. In their research work Rethinking Pre-training and Self-training by Zoph et. al conducted thorough research using COCO dataset to solve the task of image classification and ImageNet for solving object detection and proved that when the type of the task changes, the idea of using pretrained weights is not effective and infact random weight initialization can be better in such cases. Although the authors proposed the idea of self-training, that even works with strong data augmentation, but the key conclusion was that pretraining was not effective and thus transfer learning will not be useful.

#2 – If there is a data drift between the pretrained model data and the dataset used for the current task.

Considering the principles of Data Centric AI which Andrew Ng, one of the influential thought leaders in the field of AI, has been advocating about, data drift is a severe problem that leads to failure of most of the machine learning models when moved to production. This usually means that if some of the key statistical properties of the data gets changed during the training process and during the inference time, it is highly likely that the trained model may show severe overfitting results and thus show poor performance when deployed. This can usually happen if the source of the data flowing into the trained ML model gets changed and is different from the data used during the training and hyper-parameter tuning process. In practice, this happens when there is a change in camera resolution, background light exposure, random jitters introduced when we are dealing with images and similarly for sound data, the background noise level, use of a dissimilar microphone to capture the audio data. Likewise if there is any difference in the data ingestion process, the data drift is a common problem which can occur.

The following are certain statistical approaches to detect data drift:

Population Stability Index – It compares the distribution of a scoring variable (predicted probability) in scoring data set to a training data set that was used to develop the model
Kullback-Leibler divergence – KL divergence gives a measure of how much the distribution of the training data is different from the inference data.
Jensen-Shannon divergence– JS divergence is a method of measuring the similarity between two probability distributions. It is based on the KL divergence, with some notable differences, including that it is symmetric and it always has a finite value.
Kolmogorov-Smirnov test – It is a nonparametric test of the equality of continuous (or discontinuous), 1D probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test)

#3 – If there is a significant change for the type of region of interest between the pretrained model data and the dataset used for the current task.

Sometime if there is a difference in key features between the datasets used by pre-trained models and the dataset used for the current task, transfer learning can be ineffective. For example, models trained on ImageNets are trained to pick up a combination of granular features like edges, stripes, curves, line segments and even higher level features like faces, shapes of objects and animals. But let’s say the current task is to solve defect detection in mechanical parts in which each mechanical part will have different types of defects like scratches, holes, spots, smudges etc. The main regions of interest between these two types of datasets are completely different from each other. And the fine-tuning process during the transfer learning process is mostly applied at the final layers which tries to learn more higher level feature. In this example, after transfer learning the model might be able to learn the difference between different types of mechanical parts like gears, nuts and bolts, but it will not be able to distinguish between different types of defects within the parts.

#4 – Use of strong data augmentation that may introduce data drift

Sometimes aggressive data augmentation like using image adversaries or any abrupt random noise, color saturation and other custom image processing techniques can cause severe data drift between the dataset used during model pretraining and for solving the task at hand. This can also be one of the reason why you may observe transfer learning to not perform as expected. It is also recommended to use the same set of data pre-processing, rescaling and normalization methods as used by the pretrained model and the current fine-tuned model for transfer learning to work efficiently.

So, the next time if you find poor results after performing transfer learning on your model, the above points may help you investigate the root cause of the failure! The above points that I have mentioned are all based on my experience from the field and there might be several other reasons why transfer learning might not help you to get the level of model efficiency that you want! With this note, I am going to the draw the curtains for today’s edition of daily dose of data science. Stay tuned for the next one and please feel free to like, share, comment and subscribe to my posts if you find it helpful!

Tags: Aditya Bhattacharya, Artificial Intelligence, Daily Dose of Data Science, deep learning, Failure of Transfer Learning, machine learning, self training over pre-training, when not to use transfer learning

Daily Dose of Data Science – Day 8 – Scenarios in which Transfer Learning may not work