Language Models are Few-Shot Learners: Difference between revisions

Research Paper
Authors	Tom B. Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry; Amanda Askell; Sandhini Agarwal; Ariel Herbert-Voss; Gretchen Krueger; Tom Henighan; Rewon Child; Aditya Ramesh; Daniel M. Ziegler; Jeffrey Wu; Clemens Winter; Christopher Hesse; Mark Chen; Eric Sigler; Mateusz Litwin; Scott Gray; Benjamin Chess; Jack Clark; Christopher Berner; Sam McCandlish; Alec Radford; Ilya Sutskever; Dario Amodei
Year	2020
Venue	NeurIPS
Topic area	NLP
Difficulty	Research
arXiv	2005.14165
PDF	Download PDF

Revision as of 00:31, 27 April 2026

Other languages:

English
Español
中文

Languages: English | Español | 中文

Language Models are Few-Shot Learners is a 2020 paper by Brown et al. from OpenAI that introduced GPT-3, a 175-billion-parameter autoregressive language model. The paper demonstrated that sufficiently large language models can perform a wide variety of NLP tasks through in-context learning — simply by conditioning on a few examples provided in the prompt — without any gradient updates or fine-tuning.

Overview

The dominant paradigm in NLP at the time involved pre-training a model on large corpora and then fine-tuning on task-specific labeled datasets. While effective, this approach required curated datasets for every new task, introduced the possibility of spurious correlations with narrow training distributions, and did not match how humans learn tasks from minimal instruction.

GPT-3 explored an alternative: scaling up an autoregressive language model to unprecedented size and evaluating it in zero-shot, one-shot, and few-shot settings, where the model receives only a natural language description and possibly a few examples of the task within the input prompt. The results showed that scale alone could unlock emergent few-shot learning abilities competitive with or exceeding fine-tuned models on many benchmarks.

Key Contributions

GPT-3: A 175-billion-parameter autoregressive Transformer language model, over 100 times larger than GPT-2, trained on a diverse corpus of internet text.
In-context learning: Demonstration that large language models can learn tasks from examples presented in the prompt without parameter updates.
Scaling laws for few-shot performance: Evidence that few-shot performance scales smoothly with model size across three orders of magnitude (125M to 175B parameters).
Analysis of the social impacts and potential misuse of large language models, including bias, fairness, and energy consumption.

Methods

GPT-3 uses the same architecture as GPT-2 — a decoder-only Transformer with pre-normalization — but scaled to 175 billion parameters across 96 layers, with a hidden size of 12,288 and 96 attention heads. Alternating dense and locally banded sparse attention patterns were used in the layers.

The model was trained on a filtered and deduplicated dataset of approximately 570 GB of text, drawn primarily from Common Crawl (filtered for quality using a classifier trained on high-quality reference corpora), supplemented with WebText2, Books1, Books2, and English Wikipedia. Training used a batch size ramping from 32K to 3.2M tokens and a learning rate schedule with warmup.

The paper evaluated three in-context learning settings:

Zero-shot: The model receives only a natural language instruction describing the task.
One-shot: The model receives one demonstration example alongside the instruction.
Few-shot: The model receives a small number of demonstration examples (typically 10–100), limited by the context window of 2048 tokens.

In all settings, the model generates answers autoregressively without any weight updates. Task performance is measured by comparing model outputs against expected answers.

Results

GPT-3 achieved strong few-shot results across a wide range of NLP tasks:

Translation: Few-shot GPT-3 outperformed prior unsupervised methods on several language pairs, though it remained below state-of-the-art supervised systems.
Question answering: On TriviaQA, few-shot GPT-3 achieved 71.2% accuracy, competitive with fine-tuned models that access external retrieval systems.
Cloze and completion tasks: On LAMBADA, few-shot GPT-3 achieved 86.4% accuracy, surpassing the state-of-the-art by over 18 points.
SuperGLUE: Few-shot GPT-3 approached fine-tuned BERT-Large performance on several tasks, though it underperformed on some where bidirectional context is critical.

Performance consistently improved with model scale. The gap between zero-shot and few-shot performance also widened with scale, suggesting that larger models are better at leveraging in-context examples. The paper trained eight model sizes from 125M to 175B parameters to establish these scaling trends.

GPT-3 also demonstrated abilities in arithmetic, word unscrambling, and novel word usage, suggesting the emergence of more general reasoning capabilities at sufficient scale.

Impact

GPT-3 marked a turning point in AI research and commercialization. It demonstrated that scale could serve as a substitute for task-specific supervision, catalyzing the development of even larger language models and the "foundation model" paradigm. The paper directly led to the creation of the GPT API, one of the first widely available large language model services, which spawned an ecosystem of applications built on in-context learning and prompt engineering.

The paper's analysis of societal impacts — including bias amplification, potential for misuse in generating disinformation, and environmental costs of training — helped establish responsible AI disclosure as a norm in large model publications. The scaling laws it demonstrated influenced subsequent research directions, including the Chinchilla scaling analysis and efforts toward more compute-efficient training.

The concept of in-context learning introduced by GPT-3 fundamentally changed how practitioners interact with language models. Rather than training specialized models for each task, users could now write natural language prompts to elicit desired behavior — a practice that evolved into the field of prompt engineering. This shift lowered the barrier to AI application development and enabled non-experts to leverage large language models for a wide variety of tasks.

GPT-3's training cost, estimated at several million dollars, also sparked important discussions about the concentration of AI capabilities in well-funded organizations and the environmental footprint of large-scale model training.

The paper's comprehensive evaluation across dozens of benchmarks set a new standard for how large language models are assessed, moving beyond single-task leaderboards toward broad capability evaluations that better characterize a model's general intelligence.

References

Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). arXiv:2005.14165
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.
Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.

@@ Line 19: / Line 19: @@
 '''Language Models are Few-Shot Learners''' is a 2020 paper by Brown et al. from OpenAI that introduced '''GPT-3''', a 175-billion-parameter autoregressive language model. The paper demonstrated that sufficiently large language models can perform a wide variety of NLP tasks through '''in-context learning''' — simply by conditioning on a few examples provided in the prompt — without any gradient updates or fine-tuning.
-<!--T:3-->
+== Overview == <!--T:3-->
-== Overview ==
 <!--T:4-->
@@ Line 28: / Line 27: @@
 GPT-3 explored an alternative: scaling up an autoregressive language model to unprecedented size and evaluating it in zero-shot, one-shot, and few-shot settings, where the model receives only a natural language description and possibly a few examples of the task within the input prompt. The results showed that scale alone could unlock emergent few-shot learning abilities competitive with or exceeding fine-tuned models on many benchmarks.
-<!--T:6-->
+== Key Contributions == <!--T:6-->
-== Key Contributions ==
 <!--T:7-->
@@ Line 37: / Line 35: @@
 * Analysis of the social impacts and potential misuse of large language models, including bias, fairness, and energy consumption.
-<!--T:8-->
+== Methods == <!--T:8-->
-== Methods ==
 <!--T:9-->
@@ Line 57: / Line 54: @@
 In all settings, the model generates answers autoregressively without any weight updates. Task performance is measured by comparing model outputs against expected answers.
-<!--T:14-->
+== Results == <!--T:14-->
-== Results ==
 <!--T:15-->
@@ Line 75: / Line 71: @@
 GPT-3 also demonstrated abilities in arithmetic, word unscrambling, and novel word usage, suggesting the emergence of more general reasoning capabilities at sufficient scale.
-<!--T:19-->
+== Impact == <!--T:19-->
-== Impact ==
 <!--T:20-->
@@ Line 93: / Line 88: @@
 The paper's comprehensive evaluation across dozens of benchmarks set a new standard for how large language models are assessed, moving beyond single-task leaderboards toward broad capability evaluations that better characterize a model's general intelligence.
-<!--T:25-->
+== See also == <!--T:25-->
-== See also ==
 <!--T:26-->
@@ Line 101: / Line 95: @@
 * [[Efficient Estimation of Word Representations]]
-<!--T:27-->
+== References == <!--T:27-->
-== References ==
 <!--T:28-->