Chain-of-Thought Prompting/en
| Article | |
|---|---|
| Topic area | Natural Language Processing |
| Prerequisites | Transformer, Language Models are Few-Shot Learners |
Overview
Chain-of-thought prompting (CoT) is a prompting technique that elicits multi-step reasoning from large language models by instructing or demonstrating that the model should produce intermediate natural-language steps before emitting a final answer. Introduced by Wei et al. in 2022,[1] chain-of-thought prompting substantially improves performance on tasks that require arithmetic, commonsense, and symbolic reasoning, and it does so without changing model weights. The technique is now a standard component of prompt engineering for tasks where the answer depends on a sequence of inferences rather than a single pattern match.
The core observation is that sufficiently large language models can solve problems they otherwise miss when the prompt encourages them to externalize the reasoning process as text. The intermediate steps act as a scratchpad: each step conditions the next, and the final answer is conditioned on the full trace. Chain-of-thought is closely related to the broader idea of test-time computation, where additional inference-time tokens are spent in exchange for higher accuracy.
Motivation
Standard few-shot prompting, popularized by GPT-3,[2] presents a model with input-output pairs and asks it to complete a new input. This works well for tasks that map an input to an output through a single learned association, but it underperforms on multi-step problems such as multi-digit arithmetic, word problems, or logical deduction. Wei et al. showed that for these tasks the bottleneck was not the model's knowledge but its allocation of computation: forced to emit an answer in a single forward pass, it had no mechanism to decompose the problem.
By contrast, when each in-context example shows a worked solution rather than just an answer, the model imitates the format and produces its own worked solution at inference time. This reframes prompting as supplying not only the task but also a procedure for solving it.
Formulation
Let $ x $ denote the input question and $ y $ the final answer. A standard few-shot prompt models $ p(y \mid x, \mathcal{D}) $, where $ \mathcal{D} = \{(x_i, y_i)\}_{i=1}^k $ is a set of demonstrations. Chain-of-thought introduces an intermediate variable $ r $, the reasoning trace, and the prompt instead supplies $ \mathcal{D}_{\mathrm{CoT}} = \{(x_i, r_i, y_i)\} $. The model decomposes the joint distribution as
$ {\displaystyle p(r, y \mid x, \mathcal{D}_{\mathrm{CoT}}) = p(r \mid x, \mathcal{D}_{\mathrm{CoT}})\, p(y \mid x, r, \mathcal{D}_{\mathrm{CoT}}).} $
At inference time the model is decoded autoregressively: it first emits $ r $ token by token and then emits $ y $ conditioned on the full trace. Because each token in $ r $ participates in subsequent attention computations, the trace effectively expands the depth of computation available to the model beyond what a fixed-size single-token answer would allow.
A common design choice is to mark the final answer with a fixed delimiter such as "The answer is" so that $ y $ can be extracted programmatically from the generated text.
Few-Shot CoT
In few-shot chain-of-thought, each demonstration in the prompt consists of a question, a worked solution, and a final answer. A canonical example from the original paper[3] is the math word problem template:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6
tennis balls. 5 + 6 = 11. The answer is 11.
Typically four to eight such exemplars suffice to elicit chain-of-thought behavior on a new question. The exemplars need not be drawn from the test distribution; transfer across reasoning domains is common, although in-domain exemplars usually help.
Zero-Shot CoT
Kojima et al.[4] showed that the demonstrations are not strictly necessary. Appending the trigger phrase "Let's think step by step" to the question is often sufficient to produce a reasoning trace, an approach known as zero-shot CoT. A second prompt then extracts the final answer from the trace. Zero-shot CoT is weaker than few-shot CoT on most benchmarks but eliminates the labor of writing exemplars and avoids the risk that an exemplar leaks information about the answer.
Self-Consistency
Wang et al.[5] observed that reasoning traces are noisy: a model may arrive at the correct final answer through several distinct argument paths, and an incorrect path is often idiosyncratic. Self-consistency samples $ N $ independent traces with non-zero temperature and returns the answer that occurs most frequently across the resulting $ \{y^{(1)}, \dots, y^{(N)}\} $. Formally, the prediction is
$ {\displaystyle \hat{y} = \arg\max_{y} \sum_{i=1}^{N} \mathbb{1}[y^{(i)} = y].} $
Self-consistency monotonically improves accuracy with $ N $ on most reasoning benchmarks, at the cost of $ N $-fold inference compute. It is now a standard wrapper around any chain-of-thought decoder.
Tree and Graph Variants
Several extensions structure the reasoning beyond a single linear trace. Tree of Thoughts (Yao et al., 2023)[6] explores multiple partial traces, scores them with the model itself, and uses search algorithms such as breadth-first or best-first to expand the most promising branches. Graph of Thoughts generalizes this to a directed graph in which intermediate states can be merged, refined, or aggregated.
These variants shift more of the inference loop outside the model into a controller, and they trade additional compute for greater robustness on tasks where a single trace easily goes astray.
Comparison to Alternatives
Chain-of-thought is one of several approaches to multi-step reasoning. Scratchpad methods, introduced for arithmetic, train models to emit intermediate steps via supervised fine-tuning rather than prompting. Program-aided language models (PAL) and similar tool-using methods offload arithmetic and logic to an external interpreter, returning only structural reasoning to the model. Process reward models supervise individual reasoning steps during reinforcement learning rather than only the final answer. Modern reasoning-trained models, such as those produced by reinforcement learning from process feedback, can be viewed as having internalized chain-of-thought into their default decoding behavior.
Compared with these alternatives, prompted chain-of-thought has the appeal of requiring no training and being applicable to any sufficiently capable model. Its disadvantages are higher inference cost, sensitivity to exemplar choice, and the fact that the surfaced trace may not faithfully reflect the computation that produced the answer.
Emergence and Scale
Chain-of-thought benefits depend strongly on model scale. Wei et al. reported that on the GSM8K math word-problem benchmark CoT is essentially neutral or harmful below roughly 10 billion parameters and only begins to dominate the standard prompt at around 60 billion parameters. This pattern, in which a capability appears abruptly with scale, is one of the canonical examples cited in discussions of emergent abilities of large language models. The scale threshold varies with the task, the base model family, and the metric.
The interaction with instruction tuning and reinforcement learning from human feedback is also important: instruction-tuned models often produce reasoning traces by default, blurring the distinction between prompted and unprompted chain-of-thought.
Limitations
Chain-of-thought prompting has several well-documented failure modes. First, the reasoning trace is not guaranteed to be faithful: the model may emit a plausible-sounding argument that does not in fact determine the final answer it gives, a phenomenon studied under the heading of reasoning faithfulness. Second, errors compound: if an early step is wrong, the rest of the trace and the final answer typically inherit the error, and the resulting confident-sounding output can be more misleading than a direct wrong answer. Third, inference cost grows with trace length, which matters for latency-sensitive deployments. Finally, sensitivity to exemplar phrasing and ordering means that small perturbations to the prompt can produce large swings in accuracy, making prompt selection itself a tuning problem.
Despite these caveats, chain-of-thought and its descendants are now the default approach for prompting language models on tasks that require more than a single step of inference, and they form the conceptual basis for the explicit-reasoning training regimes used by current frontier models.
References
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ Wei et al., 2022, Section 3.
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv