Few-Shot Learning/en

Article
Topic area	Machine Learning
Prerequisites	Supervised Learning, Transfer Learning, Meta-Learning

Translate this page

Other languages:

Overview

Few-shot learning is a paradigm in machine learning in which a model must generalize to a new task or class given only a small number of labeled examples, typically between one and a few dozen. It contrasts with the conventional supervised regime that assumes thousands or millions of labeled samples per class, and is motivated by both practical scarcity of labeled data and a desire to mimic the human ability to recognize new concepts after limited exposure. Few-shot learning sits at the intersection of Transfer Learning, Meta-Learning, and Representation Learning, and has become a central evaluation setting for Foundation Models and Large Language Models, where the term often refers specifically to in-context demonstrations supplied at inference time.

A few-shot task is conventionally described as N-way K-shot: the learner must distinguish among N classes given K labeled examples per class, with K typically in the range of one to twenty. The K examples form the support set, and the unlabeled query examples to be classified form the query set. The aim is high query accuracy under tight sample budgets, and progress is measured against benchmarks designed to penalize models that simply overfit a fixed taxonomy.

Problem Setting

Formally, let $\mathcal{D}_{\text{train}}$ denote a large meta-training distribution of tasks and $\mathcal{D}_{\text{test}}$ a held-out distribution of novel tasks whose classes were unseen during training. Each task $\tau$ samples a support set $S = \{(x_i, y_i)\}_{i=1}^{N \cdot K}$ and a query set $Q = \{(x_j^*, y_j^*)\}$ . The objective is to learn a procedure $f_\theta$ that, given $$ S $$ , predicts $$ y_j^* $$ for each $$ x_j^* $$ :

$\theta^* = \arg\min_\theta \; \mathbb{E}_{\tau \sim \mathcal{D}_{\text{train}}} \; \mathbb{E}_{(S, Q) \sim \tau} \; \sum_{(x^*, y^*) \in Q} \mathcal{L}\bigl(f_\theta(x^*; S), y^*\bigr).$

Two boundary cases receive their own names. One-shot learning fixes $$ K = 1 $$ , requiring a single labeled example per class. Zero-shot learning sets $$ K = 0 $$ and instead relies on auxiliary information such as class names, textual descriptions, or attribute vectors to identify novel classes. Generalized few-shot learning evaluates on a mixture of base and novel classes, exposing the catastrophic forgetting that often accompanies aggressive adaptation.

Methodological Families

Few-shot methods are typically grouped into four families, each making a different bet about where useful inductive bias lives.

Metric-based methods learn an embedding space in which simple geometric rules suffice for classification. Matching Networks compute attention weights between query and support embeddings.^[1] Prototypical Networks^[2] compute a prototype $c_n = \tfrac{1}{K}\sum_{(x,y) \in S, y=n} f_\phi(x)$ per class and assign queries by nearest prototype under squared Euclidean or cosine distance:

$p(y = n \mid x) = \frac{\exp(-d(f_\phi(x), c_n))}{\sum_{n'} \exp(-d(f_\phi(x), c_{n'}))}.$

Relation Networks replace the fixed distance function with a learned comparator. The shared property is that no per-task gradient updates occur at test time.

Optimization-based methods adapt model parameters to each task with a small number of gradient steps. Model-Agnostic Meta-Learning (MAML)^[3] learns an initialization $\theta$ such that one or a few Gradient Descent updates on the support set produce strong task-specific parameters:

$\theta^* = \arg\min_\theta \sum_\tau \mathcal{L}_\tau\bigl(\theta - \alpha \nabla_\theta \mathcal{L}_\tau(\theta; S)\bigr).$

Reptile, first-order MAML, and implicit MAML reduce the cost of differentiating through the inner loop, and methods such as ANIL show that adaptation often only needs to touch the final classifier.

Model-based methods design architectures whose forward pass implements rapid adaptation. Memory-augmented networks store support examples in an external slot memory, and conditional neural processes treat the support set as input to a permutation-invariant encoder that conditions a predictive distribution.

In-context learning has emerged as the dominant few-shot interface for Large Language Models. The model is shown labeled examples in its prompt and produces predictions for a new input without any parameter updates.^[4] Performance depends sharply on demonstration ordering, label distribution, and prompt formatting, and recent work shows that the dominant signal is often the input distribution and label space rather than the input-label mapping itself.

Training and Evaluation

Episodic training, popularized by Matching Networks, samples a fresh N-way K-shot task at every training step so that the meta-loss matches the test objective. An alternative shown to be competitive on standard image benchmarks is to pre-train a strong feature extractor on the full meta-training set with ordinary classification, then attach a simple Logistic Regression or nearest-centroid head at test time. This baseline matches or exceeds many episodic methods and has reframed parts of the field around representation quality rather than meta-objectives.

Standard image benchmarks include miniImageNet, tieredImageNet, CIFAR-FS, FC100, and Meta-Dataset, the last of which deliberately samples tasks with varying numbers of ways, shots, and image domains to expose brittleness. NLP benchmarks include CrossFit, FLEX, and the few-shot tracks of GLUE and SuperGLUE. Reported numbers usually average over hundreds or thousands of sampled episodes with 95 percent confidence intervals, since single-episode variance is large.

Comparison with Related Paradigms

Few-shot learning is often confused with adjacent regimes that solve different problems.

Transfer Learning reuses a pre-trained model on a downstream task with abundant labeled data, while few-shot learning explicitly assumes the downstream budget is tiny. Meta-Learning is a strict superset that includes few-shot learning along with rapid Reinforcement Learning adaptation and continual learning. Semi-Supervised Learning assumes many unlabeled examples alongside few labeled ones; the budget asymmetry is opposite to the few-shot setting where unlabeled data may also be scarce. Self-supervised pre-training can be combined with few-shot adaptation, and in fact most strong few-shot systems begin with a self-supervised or large-scale supervised initialization.

Limitations

Several recurring failure modes complicate few-shot deployment. Performance is sensitive to the gap between meta-training and meta-test distributions, and methods that excel on miniImageNet often degrade on Meta-Dataset where domain shift is explicit. Confidence calibration tends to be poor with very small support sets, and selective prediction is unreliable. In-context learning in Large Language Models is sensitive to the choice and ordering of demonstrations, with reordering alone capable of moving accuracy by tens of percentage points. Generalized few-shot evaluation reveals that aggressive adaptation often degrades performance on base classes, a form of catastrophic forgetting. Finally, claims of strong few-shot performance can be artifacts of leakage between meta-train and meta-test class hierarchies, motivating benchmarks like Meta-Dataset that enforce explicit domain splits.

References

↑ Vinyals et al., Matching Networks for One Shot Learning, 2016.
↑ Template:Cite arxiv
↑ Template:Cite arxiv
↑ Template:Cite arxiv

[1] Vinyals et al., Matching Networks for One Shot Learning, 2016.

[2] Template:Cite arxiv

[3] Template:Cite arxiv

[4] Template:Cite arxiv

[1]

[2]

[3]

[4]