Transfer Learning

Article
Topic area	Machine Learning
Difficulty	Intermediate
Prerequisites	Neural Networks

Languages: English | Español | 中文

Transfer learning is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale pretraining, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.

Motivation

Training deep neural networks from scratch typically requires large datasets and significant computational resources. In many practical domains — medical imaging, legal text analysis, low-resource languages — labelled data is scarce. Transfer learning addresses this mismatch: a model pretrained on a data-rich source task captures general features (edges, textures, syntactic patterns) that transfer well to a data-scarce target task.

Key Concepts

Domain and Task

Formally, a domain $\mathcal{D} = \{\mathcal{X}, P(X)\}$ consists of a feature space $\mathcal{X}$ and a marginal distribution $$ P(X) $$ . A task $\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}$ consists of a label space $\mathcal{Y}$ and a predictive function $$ f $$ . Transfer learning applies when the source and target differ in domain, task, or both.

Domain Adaptation

When the source and target share the same task but differ in data distribution ( $P_s(X) \neq P_t(X)$ ), the problem is called domain adaptation. Techniques include:

Instance reweighting — adjusting sample weights so the source distribution approximates the target.
Feature alignment — learning domain-invariant representations (e.g., via adversarial training or maximum mean discrepancy).
Self-training — using model predictions on unlabelled target data as pseudo-labels.

Fine-Tuning vs Feature Extraction

Strategy	Description	When to use
Feature extraction	Freeze all pretrained layers; train only a new output head	Very small target dataset; source and target are closely related
Fine-tuning (full)	Unfreeze all layers and train end-to-end with a small learning rate	Moderate target dataset; source and target differ meaningfully
Gradual unfreezing	Progressively unfreeze layers from top to bottom over training	Balances stability of lower features with adaptation of higher ones

A common heuristic is to use a learning rate 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.

Pretrained Models

Computer Vision

ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. Fine-tuning an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.

Natural Language Processing

Language model pretraining transformed NLP. Key milestones include:

Word2Vec / GloVe — static word embeddings pretrained on large corpora.
ELMo — contextualised embeddings from bidirectional LSTMs.
BERT (Devlin et al., 2019) — bidirectional Transformer pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.
GPT series — autoregressive Transformers demonstrating that scale and pretraining enable few-shot and zero-shot transfer.

When to Use Transfer Learning

Transfer learning is most beneficial when:

The target dataset is small relative to the model's capacity.
The source and target domains share structural similarities (e.g., both involve natural images or natural language).
Computational resources for full pretraining are unavailable.
Rapid prototyping is needed before committing to large-scale data collection.

It may hurt performance (negative transfer) when the source and target domains are fundamentally dissimilar — for instance, transferring from natural images to spectrograms without appropriate adaptation.

Practical Tips

Data augmentation complements transfer learning by artificially expanding the effective size of the target dataset.
Learning rate warmup helps stabilise early training when fine-tuning large pretrained models.
Early stopping on a validation set prevents overfitting during fine-tuning, especially with small datasets.
Layer-wise learning rate decay assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.
Intermediate task transfer — fine-tuning on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.

Evaluation

Transfer learning effectiveness is typically measured by comparing:

\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}

A positive $\Delta_{\mathrm{transfer}}$ indicates successful knowledge transfer. Practitioners also track convergence speed, as transferred models often reach target performance in a fraction of the epochs.

References

Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". IEEE Transactions on Knowledge and Data Engineering.
Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". NeurIPS.
Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.
Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification". ACL.
Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". Proceedings of the IEEE.