DeployBot: [deploy-bot] Deploy from CI (8c92aeb)

2026-04-24T07:09:00Z

[deploy-bot] Deploy from CI (8c92aeb)

← Older revision		Revision as of 07:09, 24 April 2026
Line 98:		Line 98:
	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Intermediate]]		[[Category:Intermediate]]
	~~<!--v1.2.0 cache-bust-->~~
	~~<!-- pass 2 -->~~

DeployBot: Pass 2 force re-parse

2026-04-24T07:01:22Z

Pass 2 force re-parse

← Older revision		Revision as of 07:01, 24 April 2026
Line 99:		Line 99:
	[[Category:Intermediate]]		[[Category:Intermediate]]
	<!--v1.2.0 cache-bust-->		<!--v1.2.0 cache-bust-->
			<!-- pass 2 -->

DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)

2026-04-24T06:58:45Z

Force re-parse after Math source-mode rollout (v1.2.0)

← Older revision		Revision as of 06:58, 24 April 2026
Line 98:		Line 98:
	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Intermediate]]		[[Category:Intermediate]]
			<!--v1.2.0 cache-bust-->

DeployBot: [deploy-bot] Deploy from CI (775ba6e)

2026-04-24T04:01:45Z

[deploy-bot] Deploy from CI (775ba6e)

New page

{{LanguageBar | page = Transfer Learning}}
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]]}}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

'''Transfer learning''' is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale pretraining, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.

== Motivation ==

Training deep neural networks from scratch typically requires large datasets and significant computational resources. In many practical domains — medical imaging, legal text analysis, low-resource languages — labelled data is scarce. Transfer learning addresses this mismatch: a model pretrained on a data-rich source task captures general features (edges, textures, syntactic patterns) that transfer well to a data-scarce target task.

== Key Concepts ==

=== Domain and Task ===

Formally, a '''domain''' <math>\mathcal{D} = \{\mathcal{X}, P(X)\}</math> consists of a feature space <math>\mathcal{X}</math> and a marginal distribution <math>P(X)</math>. A '''task''' <math>\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}</math> consists of a label space <math>\mathcal{Y}</math> and a predictive function <math>f</math>. Transfer learning applies when the source and target differ in domain, task, or both.

=== Domain Adaptation ===

When the source and target share the same task but differ in data distribution (<math>P_s(X) \neq P_t(X)</math>), the problem is called '''domain adaptation'''. Techniques include:

* '''Instance reweighting''' — adjusting sample weights so the source distribution approximates the target.
* '''Feature alignment''' — learning domain-invariant representations (e.g., via adversarial training or maximum mean discrepancy).
* '''Self-training''' — using model predictions on unlabelled target data as pseudo-labels.

== Fine-Tuning vs Feature Extraction ==

{| class="wikitable"
|-
! Strategy !! Description !! When to use
|-
| '''Feature extraction''' || Freeze all pretrained layers; train only a new output head || Very small target dataset; source and target are closely related
|-
| '''Fine-tuning (full)''' || Unfreeze all layers and train end-to-end with a small learning rate || Moderate target dataset; source and target differ meaningfully
|-
| '''Gradual unfreezing''' || Progressively unfreeze layers from top to bottom over training || Balances stability of lower features with adaptation of higher ones
|}

A common heuristic is to use a learning rate 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.

== Pretrained Models ==

=== Computer Vision ===

ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. Fine-tuning an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.

=== Natural Language Processing ===

Language model pretraining transformed NLP. Key milestones include:

* '''Word2Vec / GloVe''' — static word embeddings pretrained on large corpora.
* '''ELMo''' — contextualised embeddings from bidirectional LSTMs.
* '''BERT''' (Devlin et al., 2019) — bidirectional Transformer pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.
* '''GPT series''' — autoregressive Transformers demonstrating that scale and pretraining enable few-shot and zero-shot transfer.

== When to Use Transfer Learning ==

Transfer learning is most beneficial when:

# The target dataset is small relative to the model's capacity.
# The source and target domains share structural similarities (e.g., both involve natural images or natural language).
# Computational resources for full pretraining are unavailable.
# Rapid prototyping is needed before committing to large-scale data collection.

It may hurt performance ('''negative transfer''') when the source and target domains are fundamentally dissimilar — for instance, transferring from natural images to spectrograms without appropriate adaptation.

== Practical Tips ==

* '''Data augmentation''' complements transfer learning by artificially expanding the effective size of the target dataset.
* '''Learning rate warmup''' helps stabilise early training when fine-tuning large pretrained models.
* '''Early stopping''' on a validation set prevents overfitting during fine-tuning, especially with small datasets.
* '''Layer-wise learning rate decay''' assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.
* '''Intermediate task transfer''' — fine-tuning on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.

== Evaluation ==

Transfer learning effectiveness is typically measured by comparing:

:<math>\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}</math>

A positive <math>\Delta_{\mathrm{transfer}}</math> indicates successful knowledge transfer. Practitioners also track convergence speed, as transferred models often reach target performance in a fraction of the epochs.

== See also ==

* [[Neural Networks]]
* [[Transformer]]
* [[Self-supervised learning]]
* [[Domain adaptation]]
* [[Fine-tuning]]

== References ==

* Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". ''IEEE Transactions on Knowledge and Data Engineering''.
* Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". ''NeurIPS''.
* Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". ''NAACL''.
* Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification". ''ACL''.
* Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". ''Proceedings of the IEEE''.

[[Category:Machine Learning]]
[[Category:Intermediate]]

Transfer Learning - Revision history

DeployBot: [deploy-bot] Deploy from CI (8c92aeb)

DeployBot: Pass 2 force re-parse

DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)

DeployBot: [deploy-bot] Deploy from CI (775ba6e)