In-Context Learning/en
| Article | |
|---|---|
| Topic area | Large Language Models |
| Prerequisites | Transformer, Large Language Model, Attention Mechanism |
Overview
In-context learning (ICL) is the ability of a pretrained Large Language Model to perform a new task at inference time by conditioning on a small number of input-output examples placed inside its prompt, without any update to its parameters. The model reads the examples as ordinary tokens and produces a completion that solves a held-out query in the same format. ICL was popularized by the GPT-3 paper, which showed that scaling autoregressive transformers to hundreds of billions of parameters caused this capability to emerge sharply, turning a single frozen model into a general-purpose few-shot learner.[1] Because no gradient flows during ICL, it is operationally distinct from Fine-tuning and from classical meta-learning: the only adaptation channel is the forward pass.
ICL matters because it changes how downstream systems are built. Instead of training a model per task, practitioners ship one base model and steer its behavior by writing prompts. This shift is the foundation of modern prompt engineering, retrieval-augmented generation, and tool-using agents, and it underlies most production deployments of frontier LLMs.
Mechanism and Intuition
A typical ICL prompt has three parts: an optional natural-language instruction, a sequence of $ k $ demonstrations of the form (input, output), and a final query input whose output the model must produce. The standard naming follows the number of demonstrations: $ k=0 $ is zero-shot, $ k=1 $ is one-shot, and $ k>1 $ is few-shot. The model never sees a separate "training" signal; it simply continues the most likely sequence given the prompt.
Intuitively, the demonstrations serve two roles simultaneously. They specify the task by showing the input distribution, output space, and label format, and they provide concrete patterns the model can pattern-match against. Empirically, surprisingly little of the gain comes from the actual input-label pairings being correct: shuffling labels often degrades accuracy only modestly, while corrupting the input distribution or label space hurts severely.[2] This suggests ICL primarily activates capabilities already latent in the pretrained model rather than learning new ones from the prompt.
Formulation
Let $ p_\theta $ denote a frozen autoregressive language model with parameters $ \theta $. Given a prompt context
$ {\displaystyle C_k = (x_1, y_1, x_2, y_2, \ldots, x_k, y_k, x_{\text{query}})} $
the model predicts the answer as
$ {\displaystyle \hat{y} = \arg\max_y p_\theta(y \mid C_k).} $
The conditional distribution $ p_\theta(\cdot \mid C_k) $ is computed entirely by running the Transformer forward pass over the concatenated token sequence; $ \theta $ is unchanged. The "learning" appears purely as a function of the context and the model's pretrained weights.
A useful framing comes from viewing ICL as implicit Bayesian inference: the model behaves as if it has a prior over latent tasks induced by pretraining, and the demonstrations sharpen its posterior over which task is being requested.[3] Under this view, demonstrations work even when the model never explicitly stored them; they update the posterior, not the parameters.
Inference Behavior
ICL has several distinctive inference-time characteristics that set it apart from other adaptation methods.
- Stateless: each new query is its own forward pass; there is no persistent change to the model. Two users in parallel can use the same model with completely different ICL prompts.
- Compute scales with context length: because self-attention is quadratic in sequence length, doubling $ k $ increases inference cost roughly quadratically (for the prefill) and linearly for decoding. KV caching and shared-prefix tricks mitigate cost when the same demonstrations are reused.
- Order-sensitive: performance can swing significantly with the order of demonstrations, especially when the model has recency bias. Selection and ordering strategies (e.g. retrieval-based, by similarity to the query) are an active area of practice.
- Format-sensitive: delimiter choice, label verbalization, and whitespace can all shift accuracy by several points. ICL is brittle in ways that fine-tuning typically is not.
Variants
A number of common variants extend the basic recipe.
- Zero-shot prompting
- No demonstrations, only an instruction. Modern instruction-tuned models often perform nearly as well zero-shot as few-shot, because instruction tuning has internalized the format.
- Few-shot prompting
- The canonical ICL setup with $ k $ demonstrations. Usually $ k $ is small (4-32) due to context limits and diminishing returns.
- Chain-of-thought (CoT)
- Demonstrations include intermediate reasoning steps, not just the final answer. CoT dramatically improves multi-step arithmetic, symbolic, and commonsense reasoning, and is now a default for analytical tasks.[4]
- Retrieval-augmented ICL
- Demonstrations are selected dynamically per query from a larger corpus, often using embedding similarity. This couples ICL with Retrieval-Augmented Generation.
- Many-shot ICL
- With million-token context windows, prompts can carry hundreds or thousands of demonstrations, sometimes approaching fine-tuning quality on narrow tasks.
- Code and tool ICL
- Demonstrations show how to call tools, write code, or follow a structured output schema; the model learns the protocol from the examples.
Theoretical Perspectives
Why a frozen Transformer performs anything resembling learning in its forward pass is an active research question, and several complementary explanations have emerged.
The induction-head account identifies specific attention-head circuits that, after sufficient pretraining, implement a "look back, then copy with shift" pattern. These heads can complete $ [A][B] \ldots [A] \to [B] $ sequences and are strongly correlated with the emergence of ICL during training.[5]
The implicit-optimizer account argues that under certain assumptions, a transformer's forward pass on ICL inputs can implement gradient descent on an implicit linear or kernel regression objective defined by the demonstrations. Constructive proofs show transformers can simulate one or more gradient steps in a single forward pass, providing a mechanistic story for in-context regression.[6]
The Bayesian account views ICL as posterior inference over latent tasks encoded in the pretraining distribution, recovering meta-learning-like behavior without an explicit meta-training loop. These three views are not mutually exclusive: induction heads can implement the lookups that an implicit gradient step or Bayesian update needs.
Comparison with Fine-tuning and Meta-learning
ICL trades parameter updates for prompt design. Compared with Fine-tuning or Parameter-Efficient Fine-tuning (e.g. LoRA), ICL has near-zero setup cost, no data-pipeline requirements, and produces no per-task artifact, but typically caps out at lower accuracy on data-rich tasks and pays a higher per-inference compute cost. Fine-tuning amortizes adaptation into weights once; ICL pays for it on every call.
Compared with classical meta-learning (e.g. MAML), ICL is meta-learning that emerged for free: the meta-training objective is just standard next-token prediction on a diverse corpus, and the inner loop is a forward pass rather than an explicit gradient step. ICL is therefore much cheaper to train but harder to control or analyze.
Hybrid approaches are common in practice: an instruction-tuned model is fine-tuned on broad task formats, then steered per-task via ICL. Retrieval pipelines combine ICL with a dynamic example store, blurring the line with non-parametric learning.
Limitations
ICL inherits the limitations of the underlying model and adds a few of its own. Performance plateaus or even degrades beyond a task-specific demonstration count; long contexts incur quadratic attention cost; and behavior is sensitive to demonstration choice, order, and formatting. ICL cannot reliably teach the model genuinely new factual content not represented in pretraining, since no parameters change; demonstrations can only steer existing capabilities. Robustness studies show ICL accuracy is often a thin gloss over hallucination when the task drifts even slightly from the demonstration distribution.
There are also safety-relevant failure modes. Adversarial demonstrations can be used to elicit unwanted behavior (prompt injection), and the same flexibility that makes ICL useful makes it hard to bound. Production systems usually combine ICL with output validation, retrieval grounding, and preference-tuned base models to limit these risks.
References
- ↑ Brown et al., "Language Models are Few-Shot Learners," 2020. Template:Cite arxiv
- ↑ Min et al., "Rethinking the Role of Demonstrations," 2022. Template:Cite arxiv
- ↑ Xie et al., "An Explanation of In-context Learning as Implicit Bayesian Inference," 2021. Template:Cite arxiv
- ↑ Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," 2022. Template:Cite arxiv
- ↑ Olsson et al., "In-context Learning and Induction Heads," 2022. Template:Cite arxiv
- ↑ von Oswald et al., "Transformers Learn In-Context by Gradient Descent," 2022. Template:Cite arxiv