Transfer Learning
| Article | |
|---|---|
| Topic area | Machine Learning |
| Difficulty | Intermediate |
| Prerequisites | Neural Networks |
Transfer learning is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale pretraining, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.
Motivation
Training deep neural networks from scratch typically requires large datasets and significant computational resources. In many practical domains — medical imaging, legal text analysis, low-resource languages — labelled data is scarce. Transfer learning addresses this mismatch: a model pretrained on a data-rich source task captures general features (edges, textures, syntactic patterns) that transfer well to a data-scarce target task.
Key Concepts
Domain and Task
Formally, a domain $ \mathcal{D} = \{\mathcal{X}, P(X)\} $ consists of a feature space $ \mathcal{X} $ and a marginal distribution $ P(X) $. A task $ \mathcal{T} = \{\mathcal{Y}, f(\cdot)\} $ consists of a label space $ \mathcal{Y} $ and a predictive function $ f $. Transfer learning applies when the source and target differ in domain, task, or both.
Domain Adaptation
When the source and target share the same task but differ in data distribution ($ P_s(X) \neq P_t(X) $), the problem is called domain adaptation. Techniques include:
- Instance reweighting — adjusting sample weights so the source distribution approximates the target.
- Feature alignment — learning domain-invariant representations (e.g., via adversarial training or maximum mean discrepancy).
- Self-training — using model predictions on unlabelled target data as pseudo-labels.
Fine-Tuning vs Feature Extraction
| Strategy | Description | When to use |
|---|---|---|
| Feature extraction | Freeze all pretrained layers; train only a new output head | Very small target dataset; source and target are closely related |
| Fine-tuning (full) | Unfreeze all layers and train end-to-end with a small learning rate | Moderate target dataset; source and target differ meaningfully |
| Gradual unfreezing | Progressively unfreeze layers from top to bottom over training | Balances stability of lower features with adaptation of higher ones |
A common heuristic is to use a learning rate 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.
Pretrained Models
Computer Vision
ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. Fine-tuning an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.
Natural Language Processing
Language model pretraining transformed NLP. Key milestones include:
- Word2Vec / GloVe — static word embeddings pretrained on large corpora.
- ELMo — contextualised embeddings from bidirectional LSTMs.
- BERT (Devlin et al., 2019) — bidirectional Transformer pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.
- GPT series — autoregressive Transformers demonstrating that scale and pretraining enable few-shot and zero-shot transfer.
When to Use Transfer Learning
Transfer learning is most beneficial when:
- The target dataset is small relative to the model's capacity.
- The source and target domains share structural similarities (e.g., both involve natural images or natural language).
- Computational resources for full pretraining are unavailable.
- Rapid prototyping is needed before committing to large-scale data collection.
It may hurt performance (negative transfer) when the source and target domains are fundamentally dissimilar — for instance, transferring from natural images to spectrograms without appropriate adaptation.
Practical Tips
- Data augmentation complements transfer learning by artificially expanding the effective size of the target dataset.
- Learning rate warmup helps stabilise early training when fine-tuning large pretrained models.
- Early stopping on a validation set prevents overfitting during fine-tuning, especially with small datasets.
- Layer-wise learning rate decay assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.
- Intermediate task transfer — fine-tuning on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.
Evaluation
Transfer learning effectiveness is typically measured by comparing:
- $ \Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}} $
A positive $ \Delta_{\mathrm{transfer}} $ indicates successful knowledge transfer. Practitioners also track convergence speed, as transferred models often reach target performance in a fraction of the epochs.
See also
References
- Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". IEEE Transactions on Knowledge and Data Engineering.
- Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". NeurIPS.
- Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.
- Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification". ACL.
- Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". Proceedings of the IEEE.