Transfer Learning/en - Revision history

FuzzyBot: Updating to match new version of source page

2026-04-27T23:35:31Z

Updating to match new version of source page

← Older revision		Revision as of 23:35, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''Transfer learning''' is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale pretraining, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.		'''Transfer learning''' is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale {{Term\|pre-training\|pretraining}}, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.

	== Motivation ==		== Motivation ==
Line 13:		Line 13:
	=== Domain and Task ===		=== Domain and Task ===

	Formally, a '''domain''' <math>\mathcal{D} = \{\mathcal{X}, P(X)\}</math> consists of a feature space <math>\mathcal{X}</math> and a marginal distribution <math>P(X)</math>. A '''task''' <math>\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}</math> consists of a label space <math>\mathcal{Y}</math> and a predictive function <math>f</math>. Transfer learning applies when the source and target differ in domain, task, or both.		Formally, a '''domain''' <math>\mathcal{D} = \{\mathcal{X}, P(X)\}</math> consists of a {{Term\|latent space\|feature space}} <math>\mathcal{X}</math> and a marginal distribution <math>P(X)</math>. A '''task''' <math>\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}</math> consists of a label space <math>\mathcal{Y}</math> and a predictive function <math>f</math>. Transfer learning applies when the source and target differ in domain, task, or both.

	=== Domain Adaptation ===		=== Domain Adaptation ===

	When the source and target share the same task but differ in data distribution (<math>P_s(X) \neq P_t(X)</math>), the problem is called '''domain adaptation'''. Techniques include:		When the source and target share the same task but differ in data distribution (<math>P_s(X) \neq P_t(X)</math>), the problem is called '''{{Term\|domain adaptation}}'''. Techniques include:

	* '''Instance reweighting''' — adjusting sample weights so the source distribution approximates the target.		* '''Instance reweighting''' — adjusting sample weights so the source distribution approximates the target.
Line 31:		Line 31:
	\| '''Feature extraction''' \|\| Freeze all pretrained layers; train only a new output head \|\| Very small target dataset; source and target are closely related		\| '''Feature extraction''' \|\| Freeze all pretrained layers; train only a new output head \|\| Very small target dataset; source and target are closely related
	\|-		\|-
	\| '''~~Fine~~-tuning (full)''' \|\| Unfreeze all layers and train end-to-end with a small learning rate \|\| Moderate target dataset; source and target differ meaningfully		\| '''{{Term\|fine-tuning}} (full)''' \|\| Unfreeze all layers and train end-to-end with a small {{Term\|learning rate}} \|\| Moderate target dataset; source and target differ meaningfully
	\|-		\|-
	\| '''Gradual unfreezing''' \|\| Progressively unfreeze layers from top to bottom over training \|\| Balances stability of lower features with adaptation of higher ones		\| '''Gradual unfreezing''' \|\| Progressively unfreeze layers from top to bottom over training \|\| Balances stability of lower features with adaptation of higher ones
	\|}		\|}

	A common heuristic is to use a learning rate 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.		A common heuristic is to use a {{Term\|learning rate}} 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.

	== Pretrained Models ==		== Pretrained Models ==
Line 42:		Line 42:
	=== Computer Vision ===		=== Computer Vision ===

	ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. ~~Fine~~-tuning an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.		ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. {{Term\|fine-tuning}} an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.

	=== Natural Language Processing ===		=== Natural Language Processing ===

	Language model pretraining transformed NLP. Key milestones include:		Language model {{Term\|pre-training\|pretraining}} transformed NLP. Key milestones include:

	* '''Word2Vec / GloVe''' — static word embeddings pretrained on large corpora.		* '''Word2Vec / GloVe''' — static word {{Term\|embedding\|embeddings}} pretrained on large corpora.
	* '''ELMo''' — contextualised embeddings from bidirectional LSTMs.		* '''ELMo''' — contextualised {{Term\|embedding\|embeddings}} from bidirectional {{Term\|long short-term memory\|LSTMs}}.
	* '''BERT''' (Devlin et al., 2019) — bidirectional ~~Transformer~~ pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.		* '''BERT''' (Devlin et al., 2019) — bidirectional {{Term\|transformer}} pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.
	* '''GPT series''' — autoregressive Transformers demonstrating that scale and pretraining enable few-shot and zero-shot transfer.		* '''GPT series''' — autoregressive {{Term\|transformer\|Transformers}} demonstrating that scale and {{Term\|pre-training\|pretraining}} enable few-shot and zero-shot transfer.

	== When to Use Transfer Learning ==		== When to Use Transfer Learning ==
Line 59:		Line 59:
	# The target dataset is small relative to the model's capacity.		# The target dataset is small relative to the model's capacity.
	# The source and target domains share structural similarities (e.g., both involve natural images or natural language).		# The source and target domains share structural similarities (e.g., both involve natural images or natural language).
	# Computational resources for full pretraining are unavailable.		# Computational resources for full {{Term\|pre-training\|pretraining}} are unavailable.
	# Rapid prototyping is needed before committing to large-scale data collection.		# Rapid prototyping is needed before committing to large-scale data collection.

Line 67:		Line 67:

	* '''Data augmentation''' complements transfer learning by artificially expanding the effective size of the target dataset.		* '''Data augmentation''' complements transfer learning by artificially expanding the effective size of the target dataset.
	* '''~~Learning~~ rate warmup''' helps stabilise early training when fine-tuning large pretrained models.		* '''{{Term\|learning rate}} warmup''' helps stabilise early training when {{Term\|fine-tuning}} large pretrained models.
	* '''Early stopping''' on a validation set prevents overfitting during fine-tuning, especially with small datasets.		* '''Early stopping''' on a validation set prevents {{Term\|overfitting}} during {{Term\|fine-tuning}}, especially with small datasets.
	* '''Layer-wise learning rate decay''' assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.		* '''Layer-wise {{Term\|learning rate}} decay''' assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.
	* '''Intermediate task transfer''' — fine-tuning on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.		* '''Intermediate task transfer''' — {{Term\|fine-tuning}} on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.

	== Evaluation ==		== Evaluation ==
Line 78:		Line 78:
	:<math>\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}</math>		:<math>\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}</math>

	A positive <math>\Delta_{\mathrm{transfer}}</math> indicates successful knowledge transfer. Practitioners also track convergence speed, as transferred models often reach target performance in a fraction of the epochs.		A positive <math>\Delta_{\mathrm{transfer}}</math> indicates successful knowledge transfer. Practitioners also track {{Term\|convergence}} speed, as transferred models often reach target performance in a fraction of the {{Term\|epoch\|epochs}}.

	== See also ==		== See also ==
Line 92:		Line 92:
	* Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". ''IEEE Transactions on Knowledge and Data Engineering''.		* Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". ''IEEE Transactions on Knowledge and Data Engineering''.
	* Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". ''NeurIPS''.		* Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". ''NeurIPS''.
	* Devlin, J. et al. (2019). "BERT: ~~Pre~~-training of Deep Bidirectional Transformers for Language Understanding". ''NAACL''.		* Devlin, J. et al. (2019). "BERT: {{Term\|pre-training}} of Deep Bidirectional {{Term\|transformer\|Transformers}} for Language Understanding". ''NAACL''.
	* Howard, J. and Ruder, S. (2018). "Universal Language Model ~~Fine~~-tuning for Text Classification". ''ACL''.		* Howard, J. and Ruder, S. (2018). "Universal Language Model {{Term\|fine-tuning}} for Text Classification". ''ACL''.
	* Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". ''Proceedings of the IEEE''.		* Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". ''Proceedings of the IEEE''.

	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Intermediate]]		[[Category:Intermediate]]

FuzzyBot: Updating to match new version of source page

2026-04-27T22:30:19Z

Updating to match new version of source page

← Older revision		Revision as of 22:30, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''Transfer learning''' is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale ~~{{Term\|pre-training\|~~pretraining}}, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.		'''Transfer learning''' is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale pretraining, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.

	== Motivation ==		== Motivation ==
Line 13:		Line 13:
	=== Domain and Task ===		=== Domain and Task ===

	Formally, a '''domain''' <math>\mathcal{D} = \{\mathcal{X}, P(X)\}</math> consists of a ~~{{Term\|latent space\|~~feature space}} <math>\mathcal{X}</math> and a marginal distribution <math>P(X)</math>. A '''task''' <math>\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}</math> consists of a label space <math>\mathcal{Y}</math> and a predictive function <math>f</math>. Transfer learning applies when the source and target differ in domain, task, or both.		Formally, a '''domain''' <math>\mathcal{D} = \{\mathcal{X}, P(X)\}</math> consists of a feature space <math>\mathcal{X}</math> and a marginal distribution <math>P(X)</math>. A '''task''' <math>\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}</math> consists of a label space <math>\mathcal{Y}</math> and a predictive function <math>f</math>. Transfer learning applies when the source and target differ in domain, task, or both.

	=== Domain Adaptation ===		=== Domain Adaptation ===

	When the source and target share the same task but differ in data distribution (<math>P_s(X) \neq P_t(X)</math>), the problem is called '''~~{{Term\|~~domain adaptation}}'''. Techniques include:		When the source and target share the same task but differ in data distribution (<math>P_s(X) \neq P_t(X)</math>), the problem is called '''domain adaptation'''. Techniques include:

	* '''Instance reweighting''' — adjusting sample weights so the source distribution approximates the target.		* '''Instance reweighting''' — adjusting sample weights so the source distribution approximates the target.
Line 31:		Line 31:
	\| '''Feature extraction''' \|\| Freeze all pretrained layers; train only a new output head \|\| Very small target dataset; source and target are closely related		\| '''Feature extraction''' \|\| Freeze all pretrained layers; train only a new output head \|\| Very small target dataset; source and target are closely related
	\|-		\|-
	\| '''~~{{Term\|fine~~-tuning}} (full)''' \|\| Unfreeze all layers and train end-to-end with a small ~~{{Term\|~~learning rate}} \|\| Moderate target dataset; source and target differ meaningfully		\| '''Fine-tuning (full)''' \|\| Unfreeze all layers and train end-to-end with a small learning rate \|\| Moderate target dataset; source and target differ meaningfully
	\|-		\|-
	\| '''Gradual unfreezing''' \|\| Progressively unfreeze layers from top to bottom over training \|\| Balances stability of lower features with adaptation of higher ones		\| '''Gradual unfreezing''' \|\| Progressively unfreeze layers from top to bottom over training \|\| Balances stability of lower features with adaptation of higher ones
	\|}		\|}

	A common heuristic is to use a ~~{{Term\|~~learning rate}} 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.		A common heuristic is to use a learning rate 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.

	== Pretrained Models ==		== Pretrained Models ==
Line 42:		Line 42:
	=== Computer Vision ===		=== Computer Vision ===

	ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. ~~{{Term\|fine~~-tuning}} an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.		ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. Fine-tuning an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.

	=== Natural Language Processing ===		=== Natural Language Processing ===

	Language model ~~{{Term\|pre-training\|~~pretraining}} transformed NLP. Key milestones include:		Language model pretraining transformed NLP. Key milestones include:

	* '''Word2Vec / GloVe''' — static word ~~{{Term\|embedding\|~~embeddings}} pretrained on large corpora.		* '''Word2Vec / GloVe''' — static word embeddings pretrained on large corpora.
	* '''ELMo''' — contextualised ~~{{Term\|embedding\|~~embeddings}} from bidirectional ~~{{Term\|long short-term memory\|~~LSTMs}}.		* '''ELMo''' — contextualised embeddings from bidirectional LSTMs.
	* '''BERT''' (Devlin et al., 2019) — bidirectional ~~{{Term\|transformer}}~~ pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.		* '''BERT''' (Devlin et al., 2019) — bidirectional Transformer pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.
	* '''GPT series''' — autoregressive ~~{{Term\|transformer\|~~Transformers}} demonstrating that scale and ~~{{Term\|pre-training\|~~pretraining}} enable few-shot and zero-shot transfer.		* '''GPT series''' — autoregressive Transformers demonstrating that scale and pretraining enable few-shot and zero-shot transfer.

	== When to Use Transfer Learning ==		== When to Use Transfer Learning ==
Line 59:		Line 59:
	# The target dataset is small relative to the model's capacity.		# The target dataset is small relative to the model's capacity.
	# The source and target domains share structural similarities (e.g., both involve natural images or natural language).		# The source and target domains share structural similarities (e.g., both involve natural images or natural language).
	# Computational resources for full ~~{{Term\|pre-training\|~~pretraining}} are unavailable.		# Computational resources for full pretraining are unavailable.
	# Rapid prototyping is needed before committing to large-scale data collection.		# Rapid prototyping is needed before committing to large-scale data collection.

Line 67:		Line 67:

	* '''Data augmentation''' complements transfer learning by artificially expanding the effective size of the target dataset.		* '''Data augmentation''' complements transfer learning by artificially expanding the effective size of the target dataset.
	* '''~~{{Term\|learning~~ rate}} warmup''' helps stabilise early training when ~~{{Term\|~~fine-tuning}} large pretrained models.		* '''Learning rate warmup''' helps stabilise early training when fine-tuning large pretrained models.
	* '''Early stopping''' on a validation set prevents ~~{{Term\|~~overfitting}} during ~~{{Term\|~~fine-tuning}}, especially with small datasets.		* '''Early stopping''' on a validation set prevents overfitting during fine-tuning, especially with small datasets.
	* '''Layer-wise ~~{{Term\|~~learning rate}} decay''' assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.		* '''Layer-wise learning rate decay''' assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.
	* '''Intermediate task transfer''' — ~~{{Term\|~~fine-tuning}} on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.		* '''Intermediate task transfer''' — fine-tuning on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.

	== Evaluation ==		== Evaluation ==
Line 78:		Line 78:
	:<math>\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}</math>		:<math>\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}</math>

	A positive <math>\Delta_{\mathrm{transfer}}</math> indicates successful knowledge transfer. Practitioners also track ~~{{Term\|~~convergence}} speed, as transferred models often reach target performance in a fraction of the ~~{{Term\|epoch\|~~epochs}}.		A positive <math>\Delta_{\mathrm{transfer}}</math> indicates successful knowledge transfer. Practitioners also track convergence speed, as transferred models often reach target performance in a fraction of the epochs.

	== See also ==		== See also ==
Line 92:		Line 92:
	* Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". ''IEEE Transactions on Knowledge and Data Engineering''.		* Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". ''IEEE Transactions on Knowledge and Data Engineering''.
	* Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". ''NeurIPS''.		* Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". ''NeurIPS''.
	* Devlin, J. et al. (2019). "BERT: ~~{{Term\|pre~~-training}} of Deep Bidirectional ~~{{Term\|transformer\|~~Transformers}} for Language Understanding". ''NAACL''.		* Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". ''NAACL''.
	* Howard, J. and Ruder, S. (2018). "Universal Language Model ~~{{Term\|fine~~-tuning}} for Text Classification". ''ACL''.		* Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification". ''ACL''.
	* Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". ''Proceedings of the IEEE''.		* Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". ''Proceedings of the IEEE''.

	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Intermediate]]		[[Category:Intermediate]]

FuzzyBot: Updating to match new version of source page

2026-04-27T19:42:37Z

Updating to match new version of source page

← Older revision		Revision as of 19:42, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''Transfer learning''' is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale pretraining, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.		'''Transfer learning''' is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale {{Term\|pre-training\|pretraining}}, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.

	== Motivation ==		== Motivation ==
Line 13:		Line 13:
	=== Domain and Task ===		=== Domain and Task ===

	Formally, a '''domain''' <math>\mathcal{D} = \{\mathcal{X}, P(X)\}</math> consists of a feature space <math>\mathcal{X}</math> and a marginal distribution <math>P(X)</math>. A '''task''' <math>\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}</math> consists of a label space <math>\mathcal{Y}</math> and a predictive function <math>f</math>. Transfer learning applies when the source and target differ in domain, task, or both.		Formally, a '''domain''' <math>\mathcal{D} = \{\mathcal{X}, P(X)\}</math> consists of a {{Term\|latent space\|feature space}} <math>\mathcal{X}</math> and a marginal distribution <math>P(X)</math>. A '''task''' <math>\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}</math> consists of a label space <math>\mathcal{Y}</math> and a predictive function <math>f</math>. Transfer learning applies when the source and target differ in domain, task, or both.

	=== Domain Adaptation ===		=== Domain Adaptation ===

	When the source and target share the same task but differ in data distribution (<math>P_s(X) \neq P_t(X)</math>), the problem is called '''domain adaptation'''. Techniques include:		When the source and target share the same task but differ in data distribution (<math>P_s(X) \neq P_t(X)</math>), the problem is called '''{{Term\|domain adaptation}}'''. Techniques include:

	* '''Instance reweighting''' — adjusting sample weights so the source distribution approximates the target.		* '''Instance reweighting''' — adjusting sample weights so the source distribution approximates the target.
Line 31:		Line 31:
	\| '''Feature extraction''' \|\| Freeze all pretrained layers; train only a new output head \|\| Very small target dataset; source and target are closely related		\| '''Feature extraction''' \|\| Freeze all pretrained layers; train only a new output head \|\| Very small target dataset; source and target are closely related
	\|-		\|-
	\| '''~~Fine~~-tuning (full)''' \|\| Unfreeze all layers and train end-to-end with a small learning rate \|\| Moderate target dataset; source and target differ meaningfully		\| '''{{Term\|fine-tuning}} (full)''' \|\| Unfreeze all layers and train end-to-end with a small {{Term\|learning rate}} \|\| Moderate target dataset; source and target differ meaningfully
	\|-		\|-
	\| '''Gradual unfreezing''' \|\| Progressively unfreeze layers from top to bottom over training \|\| Balances stability of lower features with adaptation of higher ones		\| '''Gradual unfreezing''' \|\| Progressively unfreeze layers from top to bottom over training \|\| Balances stability of lower features with adaptation of higher ones
	\|}		\|}

	A common heuristic is to use a learning rate 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.		A common heuristic is to use a {{Term\|learning rate}} 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.

	== Pretrained Models ==		== Pretrained Models ==
Line 42:		Line 42:
	=== Computer Vision ===		=== Computer Vision ===

	ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. ~~Fine~~-tuning an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.		ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. {{Term\|fine-tuning}} an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.

	=== Natural Language Processing ===		=== Natural Language Processing ===

	Language model pretraining transformed NLP. Key milestones include:		Language model {{Term\|pre-training\|pretraining}} transformed NLP. Key milestones include:

	* '''Word2Vec / GloVe''' — static word embeddings pretrained on large corpora.		* '''Word2Vec / GloVe''' — static word {{Term\|embedding\|embeddings}} pretrained on large corpora.
	* '''ELMo''' — contextualised embeddings from bidirectional LSTMs.		* '''ELMo''' — contextualised {{Term\|embedding\|embeddings}} from bidirectional {{Term\|long short-term memory\|LSTMs}}.
	* '''BERT''' (Devlin et al., 2019) — bidirectional ~~Transformer~~ pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.		* '''BERT''' (Devlin et al., 2019) — bidirectional {{Term\|transformer}} pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.
	* '''GPT series''' — autoregressive Transformers demonstrating that scale and pretraining enable few-shot and zero-shot transfer.		* '''GPT series''' — autoregressive {{Term\|transformer\|Transformers}} demonstrating that scale and {{Term\|pre-training\|pretraining}} enable few-shot and zero-shot transfer.

	== When to Use Transfer Learning ==		== When to Use Transfer Learning ==
Line 59:		Line 59:
	# The target dataset is small relative to the model's capacity.		# The target dataset is small relative to the model's capacity.
	# The source and target domains share structural similarities (e.g., both involve natural images or natural language).		# The source and target domains share structural similarities (e.g., both involve natural images or natural language).
	# Computational resources for full pretraining are unavailable.		# Computational resources for full {{Term\|pre-training\|pretraining}} are unavailable.
	# Rapid prototyping is needed before committing to large-scale data collection.		# Rapid prototyping is needed before committing to large-scale data collection.

Line 67:		Line 67:

	* '''Data augmentation''' complements transfer learning by artificially expanding the effective size of the target dataset.		* '''Data augmentation''' complements transfer learning by artificially expanding the effective size of the target dataset.
	* '''~~Learning~~ rate warmup''' helps stabilise early training when fine-tuning large pretrained models.		* '''{{Term\|learning rate}} warmup''' helps stabilise early training when {{Term\|fine-tuning}} large pretrained models.
	* '''Early stopping''' on a validation set prevents overfitting during fine-tuning, especially with small datasets.		* '''Early stopping''' on a validation set prevents {{Term\|overfitting}} during {{Term\|fine-tuning}}, especially with small datasets.
	* '''Layer-wise learning rate decay''' assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.		* '''Layer-wise {{Term\|learning rate}} decay''' assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.
	* '''Intermediate task transfer''' — fine-tuning on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.		* '''Intermediate task transfer''' — {{Term\|fine-tuning}} on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.

	== Evaluation ==		== Evaluation ==
Line 78:		Line 78:
	:<math>\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}</math>		:<math>\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}</math>

	A positive <math>\Delta_{\mathrm{transfer}}</math> indicates successful knowledge transfer. Practitioners also track convergence speed, as transferred models often reach target performance in a fraction of the epochs.		A positive <math>\Delta_{\mathrm{transfer}}</math> indicates successful knowledge transfer. Practitioners also track {{Term\|convergence}} speed, as transferred models often reach target performance in a fraction of the {{Term\|epoch\|epochs}}.

	== See also ==		== See also ==
Line 92:		Line 92:
	* Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". ''IEEE Transactions on Knowledge and Data Engineering''.		* Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". ''IEEE Transactions on Knowledge and Data Engineering''.
	* Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". ''NeurIPS''.		* Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". ''NeurIPS''.
	* Devlin, J. et al. (2019). "BERT: ~~Pre~~-training of Deep Bidirectional Transformers for Language Understanding". ''NAACL''.		* Devlin, J. et al. (2019). "BERT: {{Term\|pre-training}} of Deep Bidirectional {{Term\|transformer\|Transformers}} for Language Understanding". ''NAACL''.
	* Howard, J. and Ruder, S. (2018). "Universal Language Model ~~Fine~~-tuning for Text Classification". ''ACL''.		* Howard, J. and Ruder, S. (2018). "Universal Language Model {{Term\|fine-tuning}} for Text Classification". ''ACL''.
	* Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". ''Proceedings of the IEEE''.		* Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". ''Proceedings of the IEEE''.

	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Intermediate]]		[[Category:Intermediate]]

FuzzyBot: Updating to match new version of source page

2026-04-27T02:37:43Z

Updating to match new version of source page

← Older revision		Revision as of 02:37, 27 April 2026
Line 1:		Line 1:
	<languages />		<languages />
	~~{{LanguageBar \| page = Transfer Learning}}~~
	{{ArticleInfobox \| topic_area = Machine Learning \| difficulty = Intermediate \| prerequisites = [[Neural Networks]]}}		{{ArticleInfobox \| topic_area = Machine Learning \| difficulty = Intermediate \| prerequisites = [[Neural Networks]]}}
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

FuzzyBot: Updating to match new version of source page

2026-04-27T00:31:20Z

Updating to match new version of source page

New page

<languages />
{{LanguageBar | page = Transfer Learning}}
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]]}}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

'''Transfer learning''' is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale pretraining, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.

== Motivation ==

Training deep neural networks from scratch typically requires large datasets and significant computational resources. In many practical domains — medical imaging, legal text analysis, low-resource languages — labelled data is scarce. Transfer learning addresses this mismatch: a model pretrained on a data-rich source task captures general features (edges, textures, syntactic patterns) that transfer well to a data-scarce target task.

== Key Concepts ==

=== Domain and Task ===

Formally, a '''domain''' <math>\mathcal{D} = \{\mathcal{X}, P(X)\}</math> consists of a feature space <math>\mathcal{X}</math> and a marginal distribution <math>P(X)</math>. A '''task''' <math>\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}</math> consists of a label space <math>\mathcal{Y}</math> and a predictive function <math>f</math>. Transfer learning applies when the source and target differ in domain, task, or both.

=== Domain Adaptation ===

When the source and target share the same task but differ in data distribution (<math>P_s(X) \neq P_t(X)</math>), the problem is called '''domain adaptation'''. Techniques include:

* '''Instance reweighting''' — adjusting sample weights so the source distribution approximates the target.
* '''Feature alignment''' — learning domain-invariant representations (e.g., via adversarial training or maximum mean discrepancy).
* '''Self-training''' — using model predictions on unlabelled target data as pseudo-labels.

== Fine-Tuning vs Feature Extraction ==

{| class="wikitable"
|-
! Strategy !! Description !! When to use
|-
| '''Feature extraction''' || Freeze all pretrained layers; train only a new output head || Very small target dataset; source and target are closely related
|-
| '''Fine-tuning (full)''' || Unfreeze all layers and train end-to-end with a small learning rate || Moderate target dataset; source and target differ meaningfully
|-
| '''Gradual unfreezing''' || Progressively unfreeze layers from top to bottom over training || Balances stability of lower features with adaptation of higher ones
|}

A common heuristic is to use a learning rate 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.

== Pretrained Models ==

=== Computer Vision ===

ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. Fine-tuning an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.

=== Natural Language Processing ===

Language model pretraining transformed NLP. Key milestones include:

* '''Word2Vec / GloVe''' — static word embeddings pretrained on large corpora.
* '''ELMo''' — contextualised embeddings from bidirectional LSTMs.
* '''BERT''' (Devlin et al., 2019) — bidirectional Transformer pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.
* '''GPT series''' — autoregressive Transformers demonstrating that scale and pretraining enable few-shot and zero-shot transfer.

== When to Use Transfer Learning ==

Transfer learning is most beneficial when:

# The target dataset is small relative to the model's capacity.
# The source and target domains share structural similarities (e.g., both involve natural images or natural language).
# Computational resources for full pretraining are unavailable.
# Rapid prototyping is needed before committing to large-scale data collection.

It may hurt performance ('''negative transfer''') when the source and target domains are fundamentally dissimilar — for instance, transferring from natural images to spectrograms without appropriate adaptation.

== Practical Tips ==

* '''Data augmentation''' complements transfer learning by artificially expanding the effective size of the target dataset.
* '''Learning rate warmup''' helps stabilise early training when fine-tuning large pretrained models.
* '''Early stopping''' on a validation set prevents overfitting during fine-tuning, especially with small datasets.
* '''Layer-wise learning rate decay''' assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.
* '''Intermediate task transfer''' — fine-tuning on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.

== Evaluation ==

Transfer learning effectiveness is typically measured by comparing:

:<math>\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}</math>

A positive <math>\Delta_{\mathrm{transfer}}</math> indicates successful knowledge transfer. Practitioners also track convergence speed, as transferred models often reach target performance in a fraction of the epochs.

== See also ==

* [[Neural Networks]]
* [[Transformer]]
* [[Self-supervised learning]]
* [[Domain adaptation]]
* [[Fine-tuning]]

== References ==

* Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". ''IEEE Transactions on Knowledge and Data Engineering''.
* Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". ''NeurIPS''.
* Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". ''NAACL''.
* Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification". ''ACL''.
* Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". ''Proceedings of the IEEE''.

[[Category:Machine Learning]]
[[Category:Intermediate]]