Transfer Learning/en: Difference between revisions

    From Marovi AI
    (Updating to match new version of source page)
    (Updating to match new version of source page)
    Tag: Manual revert
     
    (2 intermediate revisions by the same user not shown)
    Line 3: Line 3:
    {{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}
    {{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}


    '''Transfer learning''' is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale pretraining, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.
    '''Transfer learning''' is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale {{Term|pre-training|pretraining}}, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.


    == Motivation ==
    == Motivation ==
    Line 13: Line 13:
    === Domain and Task ===
    === Domain and Task ===


    Formally, a '''domain''' <math>\mathcal{D} = \{\mathcal{X}, P(X)\}</math> consists of a feature space <math>\mathcal{X}</math> and a marginal distribution <math>P(X)</math>. A '''task''' <math>\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}</math> consists of a label space <math>\mathcal{Y}</math> and a predictive function <math>f</math>. Transfer learning applies when the source and target differ in domain, task, or both.
    Formally, a '''domain''' <math>\mathcal{D} = \{\mathcal{X}, P(X)\}</math> consists of a {{Term|latent space|feature space}} <math>\mathcal{X}</math> and a marginal distribution <math>P(X)</math>. A '''task''' <math>\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}</math> consists of a label space <math>\mathcal{Y}</math> and a predictive function <math>f</math>. Transfer learning applies when the source and target differ in domain, task, or both.


    === Domain Adaptation ===
    === Domain Adaptation ===


    When the source and target share the same task but differ in data distribution (<math>P_s(X) \neq P_t(X)</math>), the problem is called '''domain adaptation'''. Techniques include:
    When the source and target share the same task but differ in data distribution (<math>P_s(X) \neq P_t(X)</math>), the problem is called '''{{Term|domain adaptation}}'''. Techniques include:


    * '''Instance reweighting''' — adjusting sample weights so the source distribution approximates the target.
    * '''Instance reweighting''' — adjusting sample weights so the source distribution approximates the target.
    Line 31: Line 31:
    | '''Feature extraction''' || Freeze all pretrained layers; train only a new output head || Very small target dataset; source and target are closely related
    | '''Feature extraction''' || Freeze all pretrained layers; train only a new output head || Very small target dataset; source and target are closely related
    |-
    |-
    | '''Fine-tuning (full)''' || Unfreeze all layers and train end-to-end with a small learning rate || Moderate target dataset; source and target differ meaningfully
    | '''{{Term|fine-tuning}} (full)''' || Unfreeze all layers and train end-to-end with a small {{Term|learning rate}} || Moderate target dataset; source and target differ meaningfully
    |-
    |-
    | '''Gradual unfreezing''' || Progressively unfreeze layers from top to bottom over training || Balances stability of lower features with adaptation of higher ones
    | '''Gradual unfreezing''' || Progressively unfreeze layers from top to bottom over training || Balances stability of lower features with adaptation of higher ones
    |}
    |}


    A common heuristic is to use a learning rate 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.
    A common heuristic is to use a {{Term|learning rate}} 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.


    == Pretrained Models ==
    == Pretrained Models ==
    Line 42: Line 42:
    === Computer Vision ===
    === Computer Vision ===


    ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. Fine-tuning an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.
    ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. {{Term|fine-tuning}} an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.


    === Natural Language Processing ===
    === Natural Language Processing ===


    Language model pretraining transformed NLP. Key milestones include:
    Language model {{Term|pre-training|pretraining}} transformed NLP. Key milestones include:


    * '''Word2Vec / GloVe''' — static word embeddings pretrained on large corpora.
    * '''Word2Vec / GloVe''' — static word {{Term|embedding|embeddings}} pretrained on large corpora.
    * '''ELMo''' — contextualised embeddings from bidirectional LSTMs.
    * '''ELMo''' — contextualised {{Term|embedding|embeddings}} from bidirectional {{Term|long short-term memory|LSTMs}}.
    * '''BERT''' (Devlin et al., 2019) — bidirectional Transformer pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.
    * '''BERT''' (Devlin et al., 2019) — bidirectional {{Term|transformer}} pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.
    * '''GPT series''' — autoregressive Transformers demonstrating that scale and pretraining enable few-shot and zero-shot transfer.
    * '''GPT series''' — autoregressive {{Term|transformer|Transformers}} demonstrating that scale and {{Term|pre-training|pretraining}} enable few-shot and zero-shot transfer.


    == When to Use Transfer Learning ==
    == When to Use Transfer Learning ==
    Line 59: Line 59:
    # The target dataset is small relative to the model's capacity.
    # The target dataset is small relative to the model's capacity.
    # The source and target domains share structural similarities (e.g., both involve natural images or natural language).
    # The source and target domains share structural similarities (e.g., both involve natural images or natural language).
    # Computational resources for full pretraining are unavailable.
    # Computational resources for full {{Term|pre-training|pretraining}} are unavailable.
    # Rapid prototyping is needed before committing to large-scale data collection.
    # Rapid prototyping is needed before committing to large-scale data collection.


    Line 67: Line 67:


    * '''Data augmentation''' complements transfer learning by artificially expanding the effective size of the target dataset.
    * '''Data augmentation''' complements transfer learning by artificially expanding the effective size of the target dataset.
    * '''Learning rate warmup''' helps stabilise early training when fine-tuning large pretrained models.
    * '''{{Term|learning rate}} warmup''' helps stabilise early training when {{Term|fine-tuning}} large pretrained models.
    * '''Early stopping''' on a validation set prevents overfitting during fine-tuning, especially with small datasets.
    * '''Early stopping''' on a validation set prevents {{Term|overfitting}} during {{Term|fine-tuning}}, especially with small datasets.
    * '''Layer-wise learning rate decay''' assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.
    * '''Layer-wise {{Term|learning rate}} decay''' assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.
    * '''Intermediate task transfer''' — fine-tuning on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.
    * '''Intermediate task transfer''' — {{Term|fine-tuning}} on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.


    == Evaluation ==
    == Evaluation ==
    Line 78: Line 78:
    :<math>\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}</math>
    :<math>\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}</math>


    A positive <math>\Delta_{\mathrm{transfer}}</math> indicates successful knowledge transfer. Practitioners also track convergence speed, as transferred models often reach target performance in a fraction of the epochs.
    A positive <math>\Delta_{\mathrm{transfer}}</math> indicates successful knowledge transfer. Practitioners also track {{Term|convergence}} speed, as transferred models often reach target performance in a fraction of the {{Term|epoch|epochs}}.


    == See also ==
    == See also ==
    Line 92: Line 92:
    * Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". ''IEEE Transactions on Knowledge and Data Engineering''.
    * Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". ''IEEE Transactions on Knowledge and Data Engineering''.
    * Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". ''NeurIPS''.
    * Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". ''NeurIPS''.
    * Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". ''NAACL''.
    * Devlin, J. et al. (2019). "BERT: {{Term|pre-training}} of Deep Bidirectional {{Term|transformer|Transformers}} for Language Understanding". ''NAACL''.
    * Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification". ''ACL''.
    * Howard, J. and Ruder, S. (2018). "Universal Language Model {{Term|fine-tuning}} for Text Classification". ''ACL''.
    * Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". ''Proceedings of the IEEE''.
    * Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". ''Proceedings of the IEEE''.


    [[Category:Machine Learning]]
    [[Category:Machine Learning]]
    [[Category:Intermediate]]
    [[Category:Intermediate]]

    Latest revision as of 23:35, 27 April 2026

    Other languages:
    Article
    Topic area Machine Learning
    Difficulty Intermediate
    Prerequisites Neural Networks

    Transfer learning is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale pretraining, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.

    Motivation

    Training deep neural networks from scratch typically requires large datasets and significant computational resources. In many practical domains — medical imaging, legal text analysis, low-resource languages — labelled data is scarce. Transfer learning addresses this mismatch: a model pretrained on a data-rich source task captures general features (edges, textures, syntactic patterns) that transfer well to a data-scarce target task.

    Key Concepts

    Domain and Task

    Formally, a domain $ \mathcal{D} = \{\mathcal{X}, P(X)\} $ consists of a feature space $ \mathcal{X} $ and a marginal distribution $ P(X) $. A task $ \mathcal{T} = \{\mathcal{Y}, f(\cdot)\} $ consists of a label space $ \mathcal{Y} $ and a predictive function $ f $. Transfer learning applies when the source and target differ in domain, task, or both.

    Domain Adaptation

    When the source and target share the same task but differ in data distribution ($ P_s(X) \neq P_t(X) $), the problem is called domain adaptation. Techniques include:

    • Instance reweighting — adjusting sample weights so the source distribution approximates the target.
    • Feature alignment — learning domain-invariant representations (e.g., via adversarial training or maximum mean discrepancy).
    • Self-training — using model predictions on unlabelled target data as pseudo-labels.

    Fine-Tuning vs Feature Extraction

    Strategy Description When to use
    Feature extraction Freeze all pretrained layers; train only a new output head Very small target dataset; source and target are closely related
    fine-tuning (full) Unfreeze all layers and train end-to-end with a small learning rate Moderate target dataset; source and target differ meaningfully
    Gradual unfreezing Progressively unfreeze layers from top to bottom over training Balances stability of lower features with adaptation of higher ones

    A common heuristic is to use a learning rate 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.

    Pretrained Models

    Computer Vision

    ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. fine-tuning an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.

    Natural Language Processing

    Language model pretraining transformed NLP. Key milestones include:

    • Word2Vec / GloVe — static word embeddings pretrained on large corpora.
    • ELMo — contextualised embeddings from bidirectional LSTMs.
    • BERT (Devlin et al., 2019) — bidirectional transformer pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.
    • GPT series — autoregressive Transformers demonstrating that scale and pretraining enable few-shot and zero-shot transfer.

    When to Use Transfer Learning

    Transfer learning is most beneficial when:

    1. The target dataset is small relative to the model's capacity.
    2. The source and target domains share structural similarities (e.g., both involve natural images or natural language).
    3. Computational resources for full pretraining are unavailable.
    4. Rapid prototyping is needed before committing to large-scale data collection.

    It may hurt performance (negative transfer) when the source and target domains are fundamentally dissimilar — for instance, transferring from natural images to spectrograms without appropriate adaptation.

    Practical Tips

    • Data augmentation complements transfer learning by artificially expanding the effective size of the target dataset.
    • learning rate warmup helps stabilise early training when fine-tuning large pretrained models.
    • Early stopping on a validation set prevents overfitting during fine-tuning, especially with small datasets.
    • Layer-wise learning rate decay assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.
    • Intermediate task transferfine-tuning on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.

    Evaluation

    Transfer learning effectiveness is typically measured by comparing:

    $ \Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}} $

    A positive $ \Delta_{\mathrm{transfer}} $ indicates successful knowledge transfer. Practitioners also track convergence speed, as transferred models often reach target performance in a fraction of the epochs.

    See also

    References

    • Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". IEEE Transactions on Knowledge and Data Engineering.
    • Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". NeurIPS.
    • Devlin, J. et al. (2019). "BERT: pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.
    • Howard, J. and Ruder, S. (2018). "Universal Language Model fine-tuning for Text Classification". ACL.
    • Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". Proceedings of the IEEE.