Transfer Learning: Difference between revisions

    From Marovi AI
    (Pass 2 force re-parse)
    Tags: ci-deploy Reverted
    ([deploy-bot] Deploy from CI (8c92aeb))
    Tags: ci-deploy Manual revert
     
    Line 98: Line 98:
    [[Category:Machine Learning]]
    [[Category:Machine Learning]]
    [[Category:Intermediate]]
    [[Category:Intermediate]]
    <!--v1.2.0 cache-bust-->
    <!-- pass 2 -->

    Latest revision as of 07:09, 24 April 2026

    Languages: English | Español | 中文
    Article
    Topic area Machine Learning
    Difficulty Intermediate
    Prerequisites Neural Networks

    Transfer learning is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale pretraining, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.

    Motivation

    Training deep neural networks from scratch typically requires large datasets and significant computational resources. In many practical domains — medical imaging, legal text analysis, low-resource languages — labelled data is scarce. Transfer learning addresses this mismatch: a model pretrained on a data-rich source task captures general features (edges, textures, syntactic patterns) that transfer well to a data-scarce target task.

    Key Concepts

    Domain and Task

    Formally, a domain $ \mathcal{D} = \{\mathcal{X}, P(X)\} $ consists of a feature space $ \mathcal{X} $ and a marginal distribution $ P(X) $. A task $ \mathcal{T} = \{\mathcal{Y}, f(\cdot)\} $ consists of a label space $ \mathcal{Y} $ and a predictive function $ f $. Transfer learning applies when the source and target differ in domain, task, or both.

    Domain Adaptation

    When the source and target share the same task but differ in data distribution ($ P_s(X) \neq P_t(X) $), the problem is called domain adaptation. Techniques include:

    • Instance reweighting — adjusting sample weights so the source distribution approximates the target.
    • Feature alignment — learning domain-invariant representations (e.g., via adversarial training or maximum mean discrepancy).
    • Self-training — using model predictions on unlabelled target data as pseudo-labels.

    Fine-Tuning vs Feature Extraction

    Strategy Description When to use
    Feature extraction Freeze all pretrained layers; train only a new output head Very small target dataset; source and target are closely related
    Fine-tuning (full) Unfreeze all layers and train end-to-end with a small learning rate Moderate target dataset; source and target differ meaningfully
    Gradual unfreezing Progressively unfreeze layers from top to bottom over training Balances stability of lower features with adaptation of higher ones

    A common heuristic is to use a learning rate 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.

    Pretrained Models

    Computer Vision

    ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. Fine-tuning an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.

    Natural Language Processing

    Language model pretraining transformed NLP. Key milestones include:

    • Word2Vec / GloVe — static word embeddings pretrained on large corpora.
    • ELMo — contextualised embeddings from bidirectional LSTMs.
    • BERT (Devlin et al., 2019) — bidirectional Transformer pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.
    • GPT series — autoregressive Transformers demonstrating that scale and pretraining enable few-shot and zero-shot transfer.

    When to Use Transfer Learning

    Transfer learning is most beneficial when:

    1. The target dataset is small relative to the model's capacity.
    2. The source and target domains share structural similarities (e.g., both involve natural images or natural language).
    3. Computational resources for full pretraining are unavailable.
    4. Rapid prototyping is needed before committing to large-scale data collection.

    It may hurt performance (negative transfer) when the source and target domains are fundamentally dissimilar — for instance, transferring from natural images to spectrograms without appropriate adaptation.

    Practical Tips

    • Data augmentation complements transfer learning by artificially expanding the effective size of the target dataset.
    • Learning rate warmup helps stabilise early training when fine-tuning large pretrained models.
    • Early stopping on a validation set prevents overfitting during fine-tuning, especially with small datasets.
    • Layer-wise learning rate decay assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.
    • Intermediate task transfer — fine-tuning on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.

    Evaluation

    Transfer learning effectiveness is typically measured by comparing:

    $ \Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}} $

    A positive $ \Delta_{\mathrm{transfer}} $ indicates successful knowledge transfer. Practitioners also track convergence speed, as transferred models often reach target performance in a fraction of the epochs.

    See also

    References

    • Pan, S. J. and Yang, Q. (2010). "A Survey on Transfer Learning". IEEE Transactions on Knowledge and Data Engineering.
    • Yosinski, J. et al. (2014). "How transferable are features in deep neural networks?". NeurIPS.
    • Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.
    • Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification". ACL.
    • Zhuang, F. et al. (2021). "A Comprehensive Survey on Transfer Learning". Proceedings of the IEEE.