This page is a translated version of the page Language Models are Few-Shot Learners/paper and the translation is 100% complete.

Other languages:

SummarySource

Language Models are Few-Shot Learners

Research Paper
Authors	Tom B. Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry; Amanda Askell; Sandhini Agarwal; Ariel Herbert-Voss; Gretchen Krueger; Tom Henighan; Rewon Child; Aditya Ramesh; Daniel M. Ziegler; Jeffrey Wu; Clemens Winter; Christopher Hesse; Mark Chen; Eric Sigler; Mateusz Litwin; Scott Gray; Benjamin Chess; Jack Clark; Christopher Berner; Sam McCandlish; Alec Radford; Ilya Sutskever; Dario Amodei
Year	2020
Topic area	NLP
Difficulty	Research
arXiv	2005.14165
PDF	Download PDF

Tom B. Brown Benjamin Mann¹¹footnotemark: 1 Nick Ryder¹¹footnotemark: 1 Melanie Subbiah¹¹footnotemark: 1 Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei
OpenAI 同等贡献Johns Hopkins University, OpenAI

作者贡献见文末列示。 (2020)

摘要

近期研究表明,在大规模文本语料上进行预训练、再针对特定任务进行微调,可以在众多自然语言处理任务和基准上取得可观的提升。尽管这一方法在架构上通常是任务无关的,但仍需要每个任务配备数千乃至数万样本规模的微调数据集。相比之下,人类通常仅凭几个示例或简单指令即可完成新的语言任务——而当前的自然语言处理系统在这方面仍举步维艰。本文表明,扩大语言模型的规模可以大幅提升任务无关的少样本性能,有时甚至可与此前的最先进微调方法相媲美。具体而言,我们训练了 GPT-3,一种具有 1750 亿参数的自回归语言模型,参数量是此前任何非稀疏语言模型的 10 倍,并在少样本设定下测试其性能。对于所有任务,GPT-3 都不进行任何梯度更新或微调,任务及其少样本演示完全通过与模型的文本交互来指定。GPT-3 在众多自然语言处理数据集上表现出色,包括翻译、问答和填空(cloze)任务,以及若干需要即时推理或领域适应的任务,例如打乱字母后还原单词、在句子中使用新词或进行三位数算术。与此同时,我们也指出了一些 GPT-3 的少样本学习仍存在困难的数据集,以及一些因在大规模网络语料上训练而面临方法论问题的数据集。最后,我们发现 GPT-3 能够生成令人类评估者难以与人写文章区分开的新闻文章样本。我们讨论了这一发现以及 GPT-3 整体所带来的更广泛的社会影响。

1 引言
2 方法
3 结果
4 基准记忆的测量与防范
5 局限
6 更广泛的影响
7 相关工作
8 结论
A Common Crawl 过滤细节
B 模型训练细节
C 测试集污染研究细节
D 训练语言模型所用的总算力
E 合成新闻文章的人工质量评估
F GPT-3 的更多样例
G 任务表述与规范的细节
H 所有模型规模在所有任务上的结果

1 引言

近年来,自然语言处理系统中预训练语言表示的应用呈现出明显的趋势,且在下游迁移中以越来越灵活和任务无关的方式被使用。最初,人们通过词向量学习单层表示 [82, 102] 并将其输入到任务特定的架构中;随后,使用具有多层表示和上下文状态的 RNN 来形成更强的表示 [24, 81, 100](但仍应用于任务特定的架构);最近,预训练的循环或 transformer 语言模型 [134] 已被直接微调,完全消除了对任务特定架构的需求 [112, 20, 43]。

最后这一范式使得许多具有挑战性的自然语言处理任务取得了实质性进展,例如阅读理解、问答、文本蕴含等等,并基于新的架构和算法持续推进 [116, 74, 139, 62]。然而,该方法的一个主要限制在于:尽管架构是任务无关的,但仍需要任务特定的数据集和任务特定的微调——要在某个目标任务上取得强劲性能,通常需要在该任务专属的数千乃至数十万个样本上进行微调。出于多方面的原因,消除这一限制将是可取的。

首先,从实用角度看,每一项新任务都需要大规模的有标注样本数据集,这限制了语言模型的适用性。可能有用的语言任务范围极广,从更正语法、为某一抽象概念生成示例,到对一篇短篇小说进行评论。对于其中许多任务,要收集到大型的监督训练集是困难的,尤其当这一过程必须为每个新任务重复执行时。

其次,利用训练数据中虚假相关性的潜力,从根本上说会随着模型表达能力的增强和训练分布的狭窄程度而增加。这可能为预训练加微调范式带来问题:为了在预训练阶段吸收信息,模型被设计得很大,但随后又在非常狭窄的任务分布上进行微调。例如,[41] 观察到更大的模型不一定具有更好的分布外泛化能力。有证据表明,该范式下取得的泛化效果可能很差,因为模型过度迎合了训练分布,而难以在该分布之外良好泛化 [138, 88]。因此,微调后模型在特定基准上的表现,即便名义上达到了人类水平,也可能夸大了其在底层任务上的实际表现 [36, 91]。

第三,人类无需大型监督数据集就能学会绝大多数语言任务——只要用自然语言给出简短指令(例如"请告诉我这句话描述的是开心的事还是悲伤的事"),或至多极少量的演示(例如"以下是两个人表现勇敢的例子,请再给出第三个勇敢的例子"),通常就足以让人类以合理的胜任程度完成新任务。除了揭示当前自然语言处理技术存在的概念性局限之外,这种适应性还具有实用优势——它使人类能够无缝地混合或在多种任务与技能之间切换,例如在一段长对话中临时进行加法运算。为了具备广泛的实用价值,我们终有一天希望我们的自然语言处理系统也具有这种流畅性与通用性。

解决这些问题的一种可能途径是元学习(meta-learning)¹¹1在语言模型的语境下,这有时被称为"零样本迁移",但该术语可能存在歧义:该方法是"零样本"的,意指不进行梯度更新,但通常会在推理时向模型提供演示,因此并不是真正从零样本中学习。为避免这种混淆,我们使用"元学习"一词来涵盖该一般方法的内层/外层循环结构,并使用"上下文学习"(in-context learning)一词来指代元学习的内层循环。此外,我们根据推理时提供的演示数量,将描述进一步细分为"零样本"、"一样本"或"少样本"。这些术语旨在对"模型是在推理时从零开始学习新任务,还是仅仅识别训练时见过的模式"这一问题保持中立——这是一个重要的问题,我们将在论文后文进行讨论,但"元学习"一词意在涵盖这两种可能性,仅描述其内外层循环结构。——在语言模型的语境下意味着:模型在训练时发展出一套广泛的技能和模式识别能力,然后在推理时利用这些能力快速适应或识别所需的任务(如图 1.1 所示)。近期工作 [117] 尝试通过我们所称的"上下文学习"来实现这一点:将预训练语言模型的文本输入作为任务规范的一种形式——模型以一段自然语言指令和/或几个任务演示为条件,然后被期望仅通过预测后续内容来完成该任务的更多实例。

尽管该方法展现了一些初步的潜力,但其结果仍远逊于微调——例如,[117] 在 Natural Questions 上仅取得 4%,即便其在 CoQa 上 55 F1 的结果如今也比最先进水平落后了 35 分以上。元学习显然需要大幅改进,才能成为解决语言任务的可行实用方法。

语言建模领域的另一条近期趋势可能提供了一条出路。近年来,transformer 语言模型的容量大幅提升:从 1 亿参数 [112] 到 3 亿参数 [20]、再到 15 亿参数 [117]、80 亿参数 [125]、110 亿参数 [116],最终达到 170 亿参数 [132]。每一次扩容都带来了文本合成和/或下游自然语言处理任务上的改进,且有证据表明对数损失(其与许多下游任务高度相关)随规模呈现平滑的改进趋势 [57]。鉴于上下文学习涉及在模型参数中吸收许多技能与任务,上下文学习能力随规模呈现出同样强劲的提升,这是合理的预期。

在本文中,我们通过训练一个具有 1750 亿参数的自回归语言模型(我们称之为 GPT-3)来检验这一假设,并衡量其上下文学习能力。具体而言,我们在两打以上的自然语言处理数据集上对 GPT-3 进行评估,并设计了若干新颖任务,用以测试其对训练集中可能并不直接包含的任务的快速适应能力。对于每个任务,我们在 3 种条件下评估 GPT-3:(a)"少样本学习",即上下文学习,允许提供尽可能多的、能装入模型上下文窗口的演示(通常为 10 到 100 个);(b)"一样本学习",仅允许提供一个演示;(c)"零样本"学习,不允许任何演示,仅向模型提供自然语言指令。原则上,GPT-3 也可以在传统的微调设定下进行评估,但我们将其留给未来工作。

图 1.2 展示了我们研究的各种条件,并演示了一项要求模型从单词中去除多余符号的简单任务的少样本学习。当加入自然语言任务描述,以及随着模型上下文中示例数 ${\textstyle K}$ 的增加,模型性能均有所提升。少样本学习的表现也随模型规模显著提升。尽管本例中的结果尤为突出,但模型规模与上下文中示例数所对应的总体趋势,在我们研究的大多数任务上都成立。我们强调,这些"学习"曲线不涉及任何梯度更新或微调,仅仅是作为条件输入向模型提供越来越多的演示。

总体而言,在自然语言处理任务上,GPT-3 在零样本和一样本设定下取得了令人鼓舞的结果,而在少样本设定下,有时可与最先进水平媲美,甚至偶尔超越最先进水平(尽管最先进水平由微调后的模型保持)。例如,GPT-3 在零样本设定下在 CoQA 上达到 81.5 F1,一样本下 84.0 F1,少样本下 85.0 F1。类似地,GPT-3 在 TriviaQA 上零样本下取得 64.3% 准确率,一样本下 68.0%,少样本下 71.2%,其中最后一项相对于在同样闭卷设定下的微调模型而言是最先进水平。

GPT-3 在一些旨在测试快速适应或即时推理的任务上也展现出一样本和少样本能力,这些任务包括打乱字母还原单词、进行算术运算,以及在仅一次性看到定义后将新词用于句中。我们还展示了在少样本设定下,GPT-3 可以生成令人类评估者难以与人写文章区分的合成新闻文章。

与此同时,我们也发现了一些任务,即便在 GPT-3 这样的规模上,少样本性能仍然吃力。这些包括如 ANLI 数据集这样的自然语言推理任务,以及如 RACE 或 QuAC 等阅读理解数据集。通过对 GPT-3 的优势与劣势(包括这些局限)进行广泛刻画,我们希望激发对语言模型少样本学习的研究,并将Lua error: Internal error: The interpreter exited with status 1.引向最需要进展的方向。

总体结果的直观印象可见图 1.3,该图汇总了各类任务(尽管该图本身不应被视为一个严格或有实际意义的基准)。

我们还系统地研究了"数据污染"——这是在 Common Crawl 等数据集上训练高容量模型时日益严重的问题,因为这类数据集可能包含来自测试数据集的内容,仅仅因为此类内容常常出现在网络上。在本文中,我们开发了系统性工具来度量数据污染并量化其失真效应。虽然我们发现数据污染对 GPT-3 在大多数数据集上的性能影响极小,但我们的确发现了少数几个可能因此夸大结果的数据集;视严重程度而定,我们要么不报告这些数据集上的结果,要么用星号予以标注。

除了上述之外,我们还训练了一系列规模更小的模型(参数量从 1.25 亿到 130 亿不等),以便在零样本、一样本和少样本设定下与 GPT-3 进行性能比较。总体而言,在三种设定下,大多数任务都呈现出相对平滑的随容量扩展趋势;一个值得注意的现象是,零样本、一样本与少样本表现之间的差距通常会随模型容量增大而扩大,这或许表明更大的模型是更出色的元学习者。

最后,鉴于 GPT-3 展现出的广泛能力,我们讨论了有关偏见、公平性以及更广泛社会影响的担忧,并尝试对 GPT-3 在这些方面的特征进行初步分析。

本文其余部分的结构如下。第 2 节描述了我们用于训练 GPT-3 和评估它的方法。第 3 节给出了在零样本、一样本和少样本设定下、覆盖全部任务的结果。第 4.1 节讨论了数据污染(训练-测试重叠)问题。第 5 节讨论了 GPT-3 的局限性。第 6 节讨论了更广泛的影响。第 7 节回顾了相关工作,第 8 节进行了总结。

2 方法

Error creating thumbnail: File with dimensions greater than 12.5 MP

我们的基本Lua error: Internal error: The interpreter exited with status 1.方法,包括模型、数据和训练,与 [117] 中描述的过程相似,只是在模型规模、数据集规模与多样性、以及训练时长方面进行了相对直接的扩展。我们对上下文学习的使用也与 [117] 相似,但在本工作中,我们系统地探索了不同的上下文学习设定。因此,本节首先明确定义并对比我们将用于评估 GPT-3 或原则上可用于评估 GPT-3 的不同设定。这些设定可视为位于一个连续谱上,谱的不同位置依赖于不同程度的任务特定数据。具体而言,我们可以在该谱上至少识别出四个点(见图 2.1 的示意):

•

微调(Fine-Tuning,FT)是近年来最常见的方法,它通过在所需任务专属的监督数据集上训练,来更新预训练模型的权重。通常会使用数千到数十万个有标注样本。Lua error: Internal error: The interpreter exited with status 1.的主要优势是在许多基准上具有强劲的性能。主要劣势是每个任务都需要一个新的大型数据集、可能存在较差的分布外泛化 [88] 以及可能利用训练数据中的虚假特征 [36, 91],这可能导致与人类性能进行了不公平的比较。在本工作中,我们不对 GPT-3 进行微调,因为我们的重点在于任务无关的性能,但 GPT-3 原则上是可以被Lua error: Internal error: The interpreter exited with status 1.的,这是一个有前景的未来方向。
•

少样本(Few-Shot,FS)是我们在本工作中使用的术语,指在推理时为模型提供少量任务演示作为条件 [117],但不允许任何权重更新的设定。如图 2.1 所示,在一个典型数据集中,一个样本由一段上下文和一个期望的续写(例如一句英文与其法语翻译)组成,少样本的工作方式是给出 ${\textstyle K}$ 个上下文与续写的演示,再给出一个仅含上下文的最终样本,期望模型给出续写。我们通常将 ${\textstyle K}$ 设为 10 到 100 之间,因为这是模型上下文窗口( ${\textstyle n_{ctx} = 2048}$ )能容纳的示例数。少样本的主要优势是大幅降低对任务特定数据的需求,并减少从规模大但分布狭窄的Lua error: Internal error: The interpreter exited with status 1.数据集中学到过窄分布的可能性。主要劣势是,迄今为止该方法的结果远逊于最先进的微调模型。此外,仍需少量任务特定数据。如其名称所示,这里所描述的语言模型上下文中的少样本学习,与机器学习其他领域中的少样本学习相关 [45, 133]——两者都涉及在一个广泛的任务分布上进行学习(此处隐含于Lua error: Internal error: The interpreter exited with status 1.数据中),然后快速适应一个新任务。
•

一样本(One-Shot,1S)与少样本相同,只是除了任务的自然语言描述外,只允许一个演示,如图 1 所示。将一样本与少样本和零样本(下文)区分开的原因在于,它最接近某些任务向人类传达的方式。例如,在请人类工作者服务(如 Mechanical Turk)上让人类生成数据集时,通常会给出一个任务演示。相比之下,如果不给出任何示例,有时很难传达任务的内容或格式。
•

零样本(Zero-Shot,0S)与一样本相同,只是不允许任何演示,模型只被给予一段描述任务的自然语言指令。该方法提供了最大的便利性、潜在的鲁棒性,以及避免虚假相关性的可能(除非这些相关性在大规模Lua error: Internal error: The interpreter exited with status 1.语料中非常普遍),但也是最具挑战性的设定。在某些情况下,即便是人类也可能在没有示例的情况下难以理解任务格式,因此该设定有时"不公平地困难"。例如,如果有人被要求"列一张 200 米短跑世界纪录的表格",该请求可能模糊不清,因为表格应当采用什么格式或包含哪些内容并不明确(即便经过仔细澄清,准确理解所需内容仍可能困难)。尽管如此,至少在某些设定下,零样本最接近人类执行任务的方式——例如,在图 2.1 的翻译示例中,人类很可能仅凭文字指令就知道该做什么。

图 2.1 通过英语到法语翻译的例子展示了这四种方法。在本文中,我们聚焦于零样本、一样本和少样本,目的是不将它们视为相互竞争的替代方案进行比较,而是视为不同的问题设定,它们在特定基准上的性能与样本效率之间提供了不同的权衡。我们尤其强调少样本结果,因为其中许多仅略落后于最先进的微调模型。但归根结底,一样本,有时甚至零样本,似乎才是与人类性能最公正的比较,也是未来工作的重要目标。

下文第 2.1 节至 2.3 节分别详述了我们的模型、训练数据和训练过程。第 2.4 节讨论了我们如何进行少样本、一样本和零样本评估的细节。

2.1 模型与架构

我们使用与 GPT-2 [117] 相同的模型和架构,包括其中描述的改进初始化、预归一化和可逆Lua error: Internal error: The interpreter exited with status 1.,区别在于我们在 Lua error: Internal error: The interpreter exited with status 1. 的各层中使用密集与局部带状稀疏Lua error: Internal error: The interpreter exited with status 1.模式交替的方式,类似于 Sparse Lua error: Internal error: The interpreter exited with status 1. [15]。为了研究机器学习性能对模型规模的依赖关系,我们训练了 8 种不同规模的模型,跨越三个数量级,从 1.25 亿参数到 1750 亿参数,后者就是我们称为 GPT-3 的模型。先前的工作 [57] 表明,在拥有足够训练数据的情况下,验证损失随规模的扩展应近似呈现出平滑的幂律;训练许多不同规模的模型使我们能够针对验证损失以及下游语言任务来检验这一假设。

模型名称	${\textstyle n_{params}}$	${\textstyle n_{layers}}$	${\textstyle d_{model}}$	${\textstyle n_{heads}}$	${\textstyle d_{head}}$	批大小	学习率
GPT-3 Small	125M	12	768	12	64	0.5M	${\textstyle 6.0 \times 10^{- 4}}$
GPT-3 Medium	350M	24	1024	16	64	0.5M	${\textstyle 3.0 \times 10^{- 4}}$
GPT-3 Large	760M	24	1536	16	96	0.5M	${\textstyle 2.5 \times 10^{- 4}}$
GPT-3 XL	1.3B	24	2048	24	128	1M	${\textstyle 2.0 \times 10^{- 4}}$
GPT-3 2.7B	2.7B	32	2560	32	80	1M	${\textstyle 1.6 \times 10^{- 4}}$
GPT-3 6.7B	6.7B	32	4096	32	128	2M	${\textstyle 1.2 \times 10^{- 4}}$
GPT-3 13B	13.0B	40	5140	40	128	2M	${\textstyle 1.0 \times 10^{- 4}}$
GPT-3 175B or “GPT-3”	175.0B	96	12288	96	128	3.2M	${\textstyle 0.6 \times 10^{- 4}}$

表 2.1 给出了我们 8 个模型的规模与架构。此处 ${\textstyle n_{params}}$ 是可训练参数总数, ${\textstyle n_{layers}}$ 是层总数, ${\textstyle d_{model}}$ 是每个瓶颈层的单元数(我们始终将前馈层设为瓶颈层的四倍宽, ${\textstyle d_{ff}}$ ${\textstyle = {4 \ast d_{model}}}$ ), ${\textstyle d_{head}}$ 是每个Lua error: Internal error: The interpreter exited with status 1.头的维度。所有模型均使用 ${\textstyle n_{ctx} = 2048}$ token 的上下文窗口。我们沿深度与宽度两个维度将模型划分到多个 GPU 上,以最小化节点间的数据传输。每个模型的精确架构参数都是基于计算效率以及在 GPU 间布局时的负载均衡来选择的。先前的工作 [57] 表明,在合理较宽的范围内,验证损失对这些参数并不十分敏感。

2.2 训练数据集

语言模型的数据集快速扩张,最终形成了 Common Crawl 数据集²²2https://commoncrawl.org/the-data/ [116],规模接近一万亿词。这一数据集规模足以在不重复使用同一序列的情况下训练我们最大的模型。然而,我们发现 Common Crawl 的未过滤或仅经轻度过滤的版本,质量往往低于经过更精细筛选的数据集。因此,我们采取了 3 个步骤来提升数据集的平均质量:(1)我们下载并过滤了 CommonCrawl 的一个版本,过滤依据是其与一系列高质量参考语料的相似度;(2)我们在数据集内部和数据集之间进行了文档级的模糊去重,以防止冗余,并保持留出验证集作为衡量Lua error: Internal error: The interpreter exited with status 1.的准确指标;(3)我们还在训练混合中加入了已知的高质量参考语料,以增强 CommonCrawl 并提升其多样性。

前两点(Common Crawl 的处理)的细节详见附录 A。关于第三点,我们加入了若干精心整理的高质量数据集,包括 WebText 数据集 [117] 的扩展版本(通过在更长时间段内抓取链接收集,首次描述见 [57])、两个基于互联网的图书语料(Books1 和 Books2),以及英文维基百科。

表 2.2 给出了训练中所使用的最终数据集混合。CommonCrawl 数据从覆盖 2016 至 2019 年的月度 CommonCrawl 的 41 个分片中下载,过滤前为 45TB 的压缩明文,过滤后为 570GB,约相当于 4000 亿个 byte-pair-encoded token。请注意,在训练过程中,数据集并非按其规模比例采样;相反,我们视为更高质量的数据集采样得更频繁,因此 CommonCrawl 和 Books2 在训练中被采样不到一次,而其他数据集被采样 2 至 3 次。这实质上是为了换取更高质量的训练数据,而接受了少量的Lua error: Internal error: The interpreter exited with status 1.。

数据集	数量(tokens)	训练混合中的权重	Lua error: Internal error: The interpreter exited with status 1. elapsed when training for 300B tokens
Common Crawl (filtered)	410 billion	60%	0.44
WebText2	19 billion	22%	2.9
Books1	12 billion	8%	1.9
Books2	55 billion	8%	0.43
Wikipedia	3 billion	3%	3.4

对在大量互联网数据上预训练的语言模型而言,一个主要的方法论担忧是——尤其是对于具有记忆海量内容能力的大型模型——其下游任务可能受到污染,因为这些任务的测试或开发集可能在Lua error: Internal error: The interpreter exited with status 1.阶段被无意中看到。为减少这种污染,我们对本文研究的所有基准的开发集和测试集进行搜索,并尝试移除任何重叠。不幸的是,过滤过程中的一个错误使我们忽略了一些重叠,而由于训练成本高昂,重新训练模型是不可行的。在第 4.1 节中,我们刻画了剩余重叠所带来的影响;在未来的工作中,我们将更激进地移除数据污染。

2.3 训练过程

如 [57, 85] 所发现,更大的模型通常可以使用更大的 batch size,但需要更小的Lua error: Internal error: The interpreter exited with status 1.。我们在训练过程中测量梯度噪声尺度,并据此指导 batch size 的选择 [85]。表 2.1 给出了我们使用的参数设置。为了在不耗尽显存的前提下训练更大的模型,我们在每次矩阵乘法内部使用模型并行,同时在网络各层之间使用模型并行。所有模型均在由 Microsoft 提供的高带宽集群的一部分 V100 GPU 上训练。训练过程和Lua error: Internal error: The interpreter exited with status 1.设置的细节详见附录 B。

2.4 评估

对于少样本学习,我们对评估集中的每个样本进行评估时,从该任务的训练集中随机抽取 ${\textstyle K}$ 个样本作为条件,条件样本之间根据任务用 1 个或 2 个换行符分隔。对于 LAMBADA 和 Storycloze 没有可用的监督训练集,因此我们从开发集中抽取条件样本,并在测试集上评估。对于 Winograd(原始版本而非 SuperGLUE 版本),只存在一个数据集,因此我们直接从中抽取条件样本。

${\textstyle K}$ 可以取从 0 到模型上下文窗口允许的最大值之间的任意值,所有模型的上下文窗口为 ${\textstyle n_{ctx} = 2048}$ ,通常可容纳 ${\textstyle 10}$ 至 ${\textstyle 100}$ 个示例。较大的 ${\textstyle K}$ 通常但并非总是更好,因此当存在分别的开发集和测试集时,我们会在开发集上尝试若干 ${\textstyle K}$ 值,然后在测试集上运行最佳的那个。对于某些任务(参见附录 G),除了演示之外,我们还使用自然语言提示(在 ${\textstyle K = 0}$ 时则代替演示)。

在涉及从若干选项中选出一个正确续写(多项选择)的任务上,我们提供 ${\textstyle K}$ 个含上下文加正确续写的示例,后接一个仅含上下文的示例,然后比较语言模型对每个候选续写的似然。对于大多数任务,我们比较每个 token 的似然(以归一化长度);但在少量数据集(ARC、OpenBookQA 和 RACE)上,我们通过用每个续写的无条件概率进行归一化,在开发集上获得额外收益,计算方式为 ${\textstyle \frac{P\hspace{0pt}{(\left. {completion} \middle| {context} \right.)}}{P\hspace{0pt}{(\left. {completion} \middle| {{answer}\hspace{0pt}\_\hspace{0pt}{context}} \right.)}}}$ ,其中 ${\textstyle {answer}\hspace{0pt}\_\hspace{0pt}{context}}$ 是字符串 "Answer: " 或 "A: ",用于提示该续写应当是一个答案,但在其他方面较为通用。

对于涉及二分类的任务,我们为选项赋予语义上更有意义的名称(例如使用"True"或"False"而非 0 或 1),然后将其视为多项选择来处理;有时我们也按 [116] 的做法来构造任务(详见附录 G)。

对于自由形式续写的任务,我们使用与 [116] 相同参数的束搜索:束宽为 4,长度惩罚 ${\textstyle \alpha = 0.6}$ 。我们根据所讨论数据集的常用做法,使用 F1 相似度分数、BLEU 或精确匹配来对模型评分。

当测试集公开可用时,我们针对每种模型规模与学习设定(零样本、一样本、少样本)在测试集上报告最终结果。当测试集为私有时,我们的模型通常太大,无法装入测试服务器,因此我们报告开发集上的结果。在少数我们能够成功提交的数据集上(SuperGLUE、TriviaQA、PiQa),我们确实向测试服务器进行了提交,且仅提交 200B 模型的少样本结果,其他情况下均报告开发集上的结果。

3 结果

图 3.1 展示了第 2 节中描述的 8 个模型的训练曲线。在该图中,我们还额外加入了 6 个超小模型,其参数量低至 10 万。如 [57] 中所观察到的,在高效利用训练算力时,语言建模性能呈幂律关系。在将该趋势再扩展两个数量级后,我们仅观察到与幂律的极小偏离(如果有的话)。有人可能担心这些Lua error: Internal error: The interpreter exited with status 1.方面的改进仅来自对训练语料中虚假细节的建模。然而,我们将在接下来的章节中看到,Lua error: Internal error: The interpreter exited with status 1.的改进会在广泛的自然语言任务谱上带来一致的性能提升。

下文我们在多种数据集上评估第 2 节中描述的 8 个模型(1750 亿参数的 GPT-3 以及 7 个较小模型)。我们将数据集分为 9 类,每一类代表大致相似的任务。

在第 3.1 节中,我们在传统语言建模任务以及与语言建模类似的任务上进行评估,例如填空(Cloze)任务和句子/段落续写任务。在第 3.2 节中,我们在"闭卷"问答任务上进行评估:这类任务需要利用存储在模型参数中的信息来回答常识性问题。在第 3.3 节中,我们评估模型在不同语言之间进行翻译的能力(尤其是一样本和少样本)。在第 3.4 节中,我们评估模型在 Winograd Schema 类任务上的表现。在第 3.5 节中,我们在涉及常识推理或问答的数据集上进行评估。在第 3.6 节中我们在阅读理解任务上进行评估,在第 3.7 节中我们在 SuperGLUE 基准套件上进行评估,在 3.8 中我们简要探索 NLI。最后,在第 3.9 节中,我们设计了一些额外任务,专门用于探查上下文学习能力——这些任务聚焦于即时推理、适应能力或开放式文本合成。所有任务我们均在少样本、一样本和零样本设定下进行评估。

3.1 语言建模、完形填空与补全任务

在本节中,我们测试 GPT-3 在传统语言建模任务上的表现,以及在相关任务上的表现:这些任务包括预测某个目标词、续写一个句子或段落,或者在一段文本的可能续写之间作出选择。

3.1.1 语言建模

我们在 [117] 中所测量的 Penn Tree Bank (PTB) [86] 数据集上计算零样本困惑度。我们略去了该工作中与维基百科相关的 4 项任务,因为它们完全包含在我们的训练数据中;我们也略去了 one-billion word 基准,因为该数据集的很大一部分包含在我们的训练集中。PTB 由于早于现代互联网而避免了这些问题。我们最大的模型在 PTB 上以 15 分的显著优势创下了新的最先进水平,困惑度达到 20.50。请注意,由于 PTB 是传统的语言建模数据集,没有清晰的样本切分可用于定义一样本或少样本评估,因此我们只测量零样本。

设置	PTB
SOTA (Zero-Shot)	35.8^a
GPT-3 Zero-Shot	20.5

3.1.2 LAMBADA

设置	LAMBADA (acc)	LAMBADA (ppl)	StoryCloze (acc)	HellaSwag (acc)
SOTA	68.0^a	8.63^b	91.8^c	85.6^d
GPT-3 Zero-Shot	76.2	3.00	83.2	78.9
GPT-3 One-Shot	72.5	3.35	84.7	78.1
GPT-3 Few-Shot	86.4	1.92	87.7	79.3

LAMBADA 数据集 [99] 测试对文本中长距离依赖关系的建模——要求模型预测句子的最后一个词,而该预测需要阅读一段上下文。最近有人指出,在这一困难基准上,语言模型的持续扩展正在产生递减的收益。[9] 反思了在两个近期最先进结果之间(分别为 [125] 和 [132])模型规模翻倍仅带来 1.5% 的小幅改进,并提出"将硬件和数据规模继续扩大几个数量级并非前进之路"。我们发现该道路仍有前景,在零样本设定下 GPT-3 在 LAMBADA 上达到 76%,比此前最先进水平提高了 8%。

LAMBADA 也展示了少样本学习的灵活性,因为它提供了一种解决该数据集上一个经典问题的方法。尽管 LAMBADA 的续写始终是某个句子的最后一个词,但标准的语言模型并不知道这一细节。因此,它不仅会为正确的结尾词赋予概率,还会为该段落的其他合法延续赋予概率。这一问题过去已通过停用词过滤器部分地解决 [117](过滤"延续"类词汇)。而少样本设定则使我们能够将任务"框定"为填空测试,并通过示例使语言模型推断出所需续写恰好为一个词。我们使用如下填空格式:

Alice 是 Bob 的朋友。Alice 去拜访了她的朋友。 ${\textstyle \rightarrow}$ Bob

George 买了一些棒球装备:一个球、一只手套和一个。 ${\textstyle \rightarrow}$

当呈现以这种格式化的示例时,GPT-3 在少样本设定下取得 86.4% 的准确率,较此前最先进水平提升超过 18%。我们观察到少样本性能随模型规模显著提升。在该设定下,最小模型的性能下降近 20%,而对 GPT-3 来说则将准确率提升了 10%。最后,该填空方法在一样本设定下并不奏效——其表现总是比零样本更差。这或许是因为所有模型仍需要若干示例才能识别该模式。

需要谨慎指出的是,测试集污染分析发现 LAMBADA 数据集中有相当一部分似乎出现在我们的训练数据中;不过,第 4.1 节进行的分析表明对性能的影响可以忽略不计。

3.1.3 HellaSwag

HellaSwag 数据集 [140] 涉及为一段故事或一组指令挑选最佳结尾。其样本是经对抗性挖掘得到的,旨在对语言模型困难,但对人类来说仍然简单(人类准确率为 95.6%)。GPT-3 在一样本设定下达到 78.1% 的准确率,在少样本设定下达到 79.3%,超过了 15 亿参数的微调语言模型 [141] 所取得的 75.4% 准确率,但仍明显低于微调多任务模型 ALUM 取得的 85.6% 总体最先进水平。

3.1.4 StoryCloze

接下来,我们在 StoryCloze 2016 数据集 [83] 上评估 GPT-3,该数据集涉及为五句话长度的故事选出正确的结尾句。在零样本设定下 GPT-3 达到 83.2%,在少样本设定下( ${\textstyle K = 70}$ )达到 87.7%。这仍比基于 BERT 模型的微调最先进水平 [64] 低 4.1%,但比此前的零样本结果提升了大约 10%。

3.2 闭卷问答

设置	NaturalQS	WebQS	TriviaQA
RAG (Fine-tuned, Open-Domain) [75]	44.5	45.5	68.0
T5-11B+SSM (Fine-tuned, Closed-Book) [115]	36.6	44.7	60.5
T5-11B (Fine-tuned, Closed-Book)	34.5	37.4	50.1
GPT-3 Zero-Shot	14.6	14.4	64.3
GPT-3 One-Shot	23.0	25.3	68.0
GPT-3 Few-Shot	29.9	41.5	71.2

在本节中,我们衡量 GPT-3 回答有关广泛事实知识问题的能力。由于可能的查询数量极其庞大,该任务通常采用以下方式处理:使用信息检索系统寻找相关文本,并结合一个学习根据问题和所检索文本生成答案的模型。由于该设定允许系统去搜索并以可能包含答案的文本作为条件,故被称为"开卷"。[115] 最近证明了一个大型语言模型可以在不依赖任何辅助信息的情况下直接回答问题,效果出人意料地好。他们将这种更严格的评估设定称为"闭卷"。他们的工作表明容量更大的模型可以表现得更好,我们用 GPT-3 检验这一假设。我们在 [115] 中的 3 个数据集上评估 GPT-3:Natural Questions [58]、WebQuestions [5] 和 TriviaQA [49],使用相同的切分。请注意,除了所有结果都在闭卷设定下之外,我们使用少样本、一样本和零样本评估代表了比此前闭卷问答工作更严格的设定:除了不允许使用外部内容之外,也不允许在 Q&A 数据集本身上进行Lua error: Internal error: The interpreter exited with status 1.。

GPT-3 的结果如表 3.3 所示。在 TriviaQA 上,我们在零样本设定下取得 64.3%,一样本设定下 68.0%,少样本设定下 71.2%。零样本结果已比微调后的 T5-11B 高出 14.2%,并比一种在Lua error: Internal error: The interpreter exited with status 1.阶段进行 Q&A 专属 span 预测的版本高出 3.8%。一样本结果再提升 3.7%,与一个开放域问答系统的最先进水平持平,后者不仅进行微调,还在 2100 万文档构成的 153 亿参数稠密向量索引上使用了已学习的检索机制 [75]。GPT-3 的少样本结果在此基础上进一步提升了 3.2%。

在 WebQuestions(WebQs)上,GPT-3 在零样本设定下取得 14.4%,一样本设定下 25.3%,少样本设定下 41.5%。相比之下,微调后的 T5-11B 为 37.4%,使用 Q&A 专属Lua error: Internal error: The interpreter exited with status 1.过程的 T5-11B+SSM 为 44.7%。GPT-3 在少样本设定下接近最先进微调模型的性能。值得注意的是,与 TriviaQA 相比,WebQS 在从零样本到少样本的提升幅度要大得多(而且其零样本和一样本性能本身较差),这或许表明 WebQs 的问题和/或回答风格对 GPT-3 来说处于分布外。尽管如此,GPT-3 似乎能够适应这一分布,在少样本设定下重新取得较强的性能。

在 Natural Questions(NQs)上,GPT-3 在零样本设定下取得 14.6%,一样本设定下 23.0%,少样本设定下 29.9%,而微调后的 T5 11B+SSM 为 36.6%。与 WebQS 类似,从零样本到少样本的大幅提升可能暗示存在分布漂移,也可能解释了相对 TriviaQA 与 WebQS 而言较为逊色的表现。具体而言,NQs 中的问题往往集中于维基百科上极其细致的专门知识,这可能在测试 GPT-3 的容量上限以及其广泛的Lua error: Internal error: The interpreter exited with status 1.分布。

总体而言,在三个数据集中的一个上,GPT-3 的一样本结果与开放域Lua error: Internal error: The interpreter exited with status 1.最先进水平相当。在另外两个数据集上,尽管未使用Lua error: Internal error: The interpreter exited with status 1.,仍接近闭卷最先进水平。在所有 3 个数据集上,我们发现性能随模型规模非常平滑地扩展(图 3.3 和附录 H 图 H.7),这可能反映了一种观点:模型容量直接转化为吸收进模型参数中的更多"知识"。

3.3 翻译

对于 GPT-2,因容量考虑,使用了过滤器从多语种文档集合中生成仅英文的数据集。即便如此,GPT-2 仍展现出一定的多语种能力,在仅训练于 10 兆字节剩余法语文本的情况下,在法英互译上也取得了非平凡的表现。由于从 GPT-2 到 GPT-3 我们将容量提升了两个数量级以上,我们也扩展了训练数据集的范围,以纳入对其他语言更多的代表性内容,尽管这仍是有待进一步改进的领域。如第 2.2 节所述,我们的大部分数据来自原始的 Common Crawl,仅经过基于质量的过滤。尽管 GPT-3 的训练数据仍主要为英文(按词数统计为 93%),它也包含 7% 的其他语种文本。这些语言记录在补充材料中。为了更好地理解翻译能力,我们的分析也扩展到包括另外两种常见研究的语言:德语和罗马尼亚语。

现有的无监督机器翻译方法通常将一对单语数据集上的Lua error: Internal error: The interpreter exited with status 1.与回译(back-translation)[123] 相结合,以受控的方式在两种语言之间建立联系。相比之下,GPT-3 从一种混合的训练数据中学习,这些数据以自然方式将多种语言混合在一起,在词、句和文档层面进行融合。GPT-3 也使用单一的训练目标,该目标既未针对任何特定任务进行定制,也未为其设计。然而,我们的一样本/少样本设定与此前的无监督工作并非严格可比,因为它们使用了少量配对样例(1 或 64 个)。这相当于多至一两页的上下文内训练数据。

设置	En ${\textstyle \rightarrow}$ Fr	Fr ${\textstyle \rightarrow}$ En	En ${\textstyle \rightarrow}$ De	De ${\textstyle \rightarrow}$ En	En ${\textstyle \rightarrow}$ Ro	Ro ${\textstyle \rightarrow}$ En
SOTA (Supervised)	45.6^a	35.0 ^b	41.2^c	40.2^d	38.5^e	39.9^e
XLM [61]	33.4	33.3	26.4	34.3	33.3	31.8
MASS [127]	37.5	34.9	28.3	35.2	35.2	33.1
mBART [66]	-	-	29.8	34.0	35.0	30.5
GPT-3 Zero-Shot	25.2	21.2	24.6	27.2	14.1	19.9
GPT-3 One-Shot	28.3	33.7	26.2	30.4	20.6	38.6
GPT-3 Few-Shot	32.6	39.2	29.7	40.6	21.0	39.5

结果如表 3.4 所示。仅获得自然语言任务描述的零样本 GPT-3 仍逊色于近期的无监督神经机器翻译(NMT)结果。然而,为每项翻译任务仅提供一个示例演示,就能将性能提升超过 7 BLEU,接近与此前工作相竞争的水平。在完整的少样本设定下,GPT-3 又进一步提升约 4 BLEU,最终平均性能与此前无监督 NMT 工作相当。GPT-3 的性能在不同翻译方向上呈现明显偏斜。对于所研究的三种源语言,在翻译为英语时,GPT-3 显著优于此前的无监督 NMT 工作;但在反方向翻译时则逊色。En-Ro 上的表现是一个明显的离群值,比此前的无监督 NMT 工作低 10 BLEU 以上。这可能是因复用 GPT-2 的字节级 BPE Lua error: Internal error: The interpreter exited with status 1.所致——后者是为几乎完全由英文构成的训练数据集开发的。对于 Fr-En 和 De-En,少样本 GPT-3 超过我们能找到的最佳监督结果,但由于我们对该领域文献不熟悉,且这些基准看起来并不具竞争性,我们并不认为这些结果代表了真正的最先进水平。对于 Ro-En,少样本 GPT-3 与总体最先进水平相差不到 0.5 BLEU,后者通过无监督Lua error: Internal error: The interpreter exited with status 1.、在 608K 有标签样本上进行有监督Lua error: Internal error: The interpreter exited with status 1.以及回译的组合实现 [70]。

最后,在所有语言对和三种设定(零样本、一样本和少样本)上,性能均随模型容量平滑提升。少样本结果的这一趋势如图 3.4 所示,三种设定下的扩展情况详见附录 H。

3.4 Winograd 类任务

设置	Winograd	Winogrande (XL)
Fine-tuned SOTA	90.1^a	84.6^b
GPT-3 Zero-Shot	88.3*	70.2
GPT-3 One-Shot	89.7*	73.2
GPT-3 Few-Shot	88.6*	77.7

Winograd 模式挑战赛 [65] 是自然语言处理中的一项经典任务,需要确定某个代词所指的是哪个词——该代词在语法上具有歧义,但对人类来说在语义上是明确的。最近,微调后的语言模型在原始 Winograd 数据集上已达到接近人类的表现,但更困难的版本,例如经对抗性挖掘得到的 Winogrande 数据集 [118],仍显著落后于人类表现。我们在 Winograd 和 Winogrande 上测试 GPT-3 的表现,一如既往地在零样本、一样本和少样本设定下进行。

在 Winograd 上,我们使用 [117] 中描述的相同"部分评估"方法,在原始的 273 个 Winograd 模式上评估 GPT-3。请注意,该设定与 SuperGLUE 基准中的 WSC 任务略有不同——后者以二分类形式呈现,且需要实体抽取才能转换为本节所描述的形式。在 Winograd 上,GPT-3 在零样本、一样本和少样本设定下分别取得 88.3%、89.7% 和 88.6%,并未呈现明显的上下文学习,但在所有设定下均取得了仅比最先进水平和估计的人类表现低几个百分点的强劲结果。我们注意到污染分析在训练数据中发现了一些 Winograd 模式,但这似乎对结果影响很小(参见第 4.1 节)。

在更困难的 Winogrande 数据集上,我们确实发现上下文学习带来了收益:GPT-3 在零样本设定下达到 70.2%,一样本下 73.2%,少样本下 77.7%。作为对比,微调后的 RoBERTA 模型达到 79%,最先进水平为 84.6%(由微调的高容量模型 T5 取得),而 [118] 报告的人类在该任务上的表现为 94.0%。

3.5 常识推理

设置	PIQA	ARC (Easy)	ARC (Challenge)	OpenBookQA
Fine-tuned SOTA	79.4	92.0[55]	78.5[55]	87.2[55]
GPT-3 Zero-Shot	80.5*	68.8	51.4	57.6
GPT-3 One-Shot	80.5*	71.2	53.2	58.8
GPT-3 Few-Shot	82.8*	70.1	51.5	65.4

接下来我们考察三个旨在捕捉物理或科学推理的数据集,这与句子续写、阅读理解或广义知识问答有所区别。第一个,PhysicalQA(PIQA)[11],就物理世界如何运作提出常识性问题,意在作为对世界的具身理解的探测。GPT-3 在零样本下取得 81.0% 准确率,一样本下 80.5%,少样本下 82.8%(后者在 PIQA 的测试服务器上测得)。这相比此前微调 RoBERTa 取得的 79.4% 准确率最先进水平表现良好。PIQA 随模型规模呈现相对浅薄的扩展,且仍比人类性能低 10% 以上,但 GPT-3 的少样本乃至零样本结果都超过了当前的最先进水平。我们的分析将 PIQA 标记为存在潜在的数据污染问题(尽管测试标签是隐藏的),因此我们保守地用星号标注该结果。详情见第 4.1 节。

ARC [14] 是从 3 至 9 年级科学考试中收集的多项选择题数据集。在数据集的"挑战(Challenge)"版本中(该版本经过过滤,只保留简单统计或信息检索方法无法正确回答的题目),GPT-3 在零样本设定下达到 51.4% 的准确率,一样本下 53.2%,少样本下 51.5%。这接近 UnifiedQA [55] 中微调 RoBERTa 基线(55.9%)的表现。在数据集的"简单(Easy)"版本(被前述任一基线方法正确回答的题目)上,GPT-3 取得 68.8%、71.2% 和 70.1%,略微超过 [55] 的微调 RoBERTa 基线。然而,这两组结果仍远逊于 UnifiedQA 取得的总体最先进水平,后者在挑战集上比 GPT-3 的少样本结果高 27%,在简单集上高 22%。

在 OpenBookQA [84] 上,GPT-3 从零样本到少样本设定下有显著改进,但仍比总体最先进水平低 20 个百分点以上。GPT-3 的少样本性能与排行榜上微调后的 BERT Large 基线相近。

总体而言,GPT-3 的上下文学习在常识推理任务上表现喜忧参半:在 PIQA 和 ARC 上,一样本与少样本设定下仅观察到很小且不一致的收益,但在 OpenBookQA 上观察到了显著的改进。在新的 PIQA 数据集上,GPT-3 在所有评估设定下均刷新了最先进水平。

3.6 阅读理解

设置	CoQA	DROP	QuAC	SQuADv2	RACE-h	RACE-m
Fine-tuned SOTA	90.7^a	89.1^b	74.4^c	93.0^d	90.0^e	93.1^e
GPT-3 Zero-Shot	81.5	23.6	41.5	59.5	45.5	58.4
GPT-3 One-Shot	84.0	34.3	43.3	65.4	45.9	57.4
GPT-3 Few-Shot	85.0	36.5	44.3	69.8	46.8	58.1

接下来我们在阅读理解任务上评估 GPT-3。我们使用一套 5 个数据集,涵盖抽象式、选择题和基于片段的答案格式,涉及对话和单题两种设定。我们观察到 GPT-3 在这些数据集上的表现差异很大,提示其在不同答案格式下的能力存在差别。总体而言,我们观察到 GPT-3 与每个相应数据集上的早期基线及使用上下文表示训练的早期结果大致相当。

GPT-3 在 CoQA [106](自由形式的对话型数据集)上表现最佳(与人类基线相差不到 3 分),在 QuAC [16] 上表现最差(比 ELMo 基线低 13 F1),后者需要建模师生互动中结构化的对话行为和答案片段选择。在 DROP [27](一个在阅读理解情境下测试离散推理和数感的数据集)上,少样本设定的 GPT-3 超过了原文中微调的 BERT 基线,但仍远低于人类表现以及将神经网络与符号系统相结合的最先进方法 [110]。在 SQuAD 2.0 [108] 上,GPT-3 展示了其少样本学习能力,相比零样本设定提升了近 10 F1(达到 69.8)。这使它略微超过了原文中最佳的微调结果。在 RACE [78](一个由初中和高中英语考试组成的多项选择数据集)上,GPT-3 表现相对较弱,仅与最早采用上下文表示的工作相竞争,仍比最先进水平落后 45%。

	SuperGLUE	BoolQ	CB	CB	COPA	RTE
	Average	准确率	准确率	F1	准确率	准确率
Fine-tuned SOTA	89.0	91.0	96.9	93.9	94.8	92.5
Fine-tuned BERT-Large	69.0	77.4	83.6	75.7	70.6	71.7
GPT-3 Few-Shot	71.8	76.4	75.6	52.0	92.0	69.0

	WiC	WSC	MultiRC	MultiRC	ReCoRD	ReCoRD
	Accuracy	准确率	准确率	F1a	准确率	F1
Fine-tuned SOTA	76.1	93.8	62.3	88.2	92.5	93.3
Fine-tuned BERT-Large	69.6	64.6	24.1	70.0	71.3	72.0
GPT-3 Few-Shot	49.4	80.1	30.5	75.4	90.2	91.1

3.7 SuperGLUE

为了更好地汇总自然语言处理任务上的结果,并以更系统的方式与 BERT 和 RoBERTa 等流行模型进行比较,我们还在一个标准化的数据集合,即 SuperGLUE 基准 [135] [135] [17] [25] [105] [54] [142] [21] [8] [34] [6] [96] [98] 上评估 GPT-3。GPT-3 在 SuperGLUE 数据集测试集上的表现见表 3.8。在少样本设定下,我们对所有任务都使用 32 个示例,从训练集中随机抽样。除 WSC 和 MultiRC 之外,对于其他所有任务,我们都为每个问题在上下文中重新抽取一组新的示例。对于 WSC 和 MultiRC,我们对所评估的所有问题都使用从训练集中随机抽取的同一组示例作为上下文。

我们观察到 GPT-3 在不同任务上的表现差异很大。在 COPA 和 ReCoRD 上,GPT-3 在一样本和少样本设定下取得接近最先进水平的成绩,其中 COPA 仅落后几分,在排行榜上排名第二,排名第一的是一个 110 亿参数的微调模型(T5)。在 WSC 上,性能仍然较强,在少样本设定下达到 80.1%(请注意,如第 3.4 节所述,GPT-3 在原始 Winograd 数据集上达到 88.6%)。在 BoolQ、MultiRC 和 RTE 上,性能尚可,大致与微调后的 BERT-Large 相当。在 CB 上,我们看到了一线生机:在少样本设定下达到 75.6%。

WiC 是一个明显的弱点,少样本性能为 49.4%(等同于随机猜测)。我们尝试了多种针对 WiC 的不同表述与构造方式(WiC 涉及判断同一个词在两个句子中是否以相同含义被使用),其中没有一种能够取得较强的表现。这暗示了一种现象,在下一节(讨论 ANLI 基准)将更为明显:GPT-3 在少样本或一样本设定下,似乎在某些涉及比较两个句子或片段的任务上表现较弱,例如:某个词在两句中是否以同样方式被使用(WiC)、某句是否是另一句的释义,或某句是否蕴含另一句。这也能解释 RTE 和 CB 上相对较低的成绩,它们同样采用了这种格式。尽管存在这些弱点,GPT-3 在八个任务中的四个仍超越微调后的 BERT-large,并在两个任务上接近由微调的 110 亿参数模型保持的最先进水平。

最后,我们注意到少样本 SuperGLUE 分数随模型规模和上下文示例数稳步提升,显示出从上下文学习中获得的收益不断增加(图 3.8)。我们将每项任务的 ${\textstyle K}$ 扩大到 32 个示例,超过该值后额外的示例就无法可靠地装入我们的上下文。对 ${\textstyle K}$ 取值进行扫描时,我们发现 GPT-3 每项任务总共所需示例少于 8 个,即可在总体 SuperGLUE 分数上超过微调后的 BERT-Large。

3.8 NLI

自然语言推理(NLI)[31] 关注理解两个句子之间关系的能力。实际中,该任务通常构造为二分类或三分类问题,模型对第二句相对于第一句是否在逻辑上成立、是否与第一句矛盾,或是否可能为真(中性)进行分类。SuperGLUE 包含一个 NLI 数据集 RTE,它评估该任务的二分类版本。在 RTE 上,只有最大版本的 GPT-3 在任何评估设定下的表现明显优于随机(56%),但在少样本设定下,GPT-3 与单任务微调的 BERT Large 表现相近。我们还在最近引入的对抗性自然语言推理(ANLI)数据集上进行评估 [94]。ANLI 是一个困难的数据集,采用三轮(R1、R2 和 R3)经对抗性挖掘得到的自然语言推理问题。与 RTE 类似,即使在少样本设定下,我们所有比 GPT-3 小的模型在 ANLI 上的表现几乎与随机相同( ${\textstyle \sim {33\%}}$ ),而 GPT-3 本身在第 3 轮上显示出生机。ANLI R3 的结果在图 3.9 中突出显示,所有轮次的完整结果见附录 H。RTE 和 ANLI 的这些结果表明,NLI 对语言模型而言仍是非常困难的任务,它们刚刚开始显示出进展的迹象。

3.9 合成与定性任务

探查 GPT-3 在少样本(或零样本和一样本)设定下能力范围的一种方式,是给它一些需要进行简单的即时计算推理、识别训练中不太可能出现的新颖模式,或快速适应不常见任务的任务。我们设计了若干任务来测试此类能力。首先,我们测试 GPT-3 进行算术运算的能力。其次,我们创建若干涉及对单词中字母进行重新排列或还原的任务,这些任务不太可能在训练中被精确见过。第三,我们以少样本方式测试 GPT-3 求解 SAT 风格类比题的能力。最后,我们在若干定性任务上测试 GPT-3,包括将新词用于句中、修改英文语法以及新闻文章生成。我们将公开这些合成数据集,希望能促进对语言模型测试时行为的进一步研究。

3.9.1 算术

为了测试 GPT-3 在没有任务特定训练的情况下进行简单算术运算的能力,我们设计了一组 10 项小测试,以自然语言向 GPT-3 提出一个简单的算术问题:

•

2 位数加法(2D+)——要求模型将从 ${\textstyle \lbrack 0,100)}$ 均匀采样的两个整数相加,以问题形式表述,例如"Q: What is 48 plus 76? A: 124."
•

2 位数减法(2D-)——要求模型对从 ${\textstyle \lbrack 0,100)}$ 均匀采样的两个整数相减;答案可能为负数。示例:"Q: What is 34 minus 53? A: -19"。
•

3 位数加法(3D+)——与 2 位数加法相同,只是数从 ${\textstyle \lbrack 0,1000)}$ 均匀采样。
•

3 位数减法(3D-)——与 2 位数减法相同,只是数从 ${\textstyle \lbrack 0,1000)}$ 均匀采样。
•

4 位数加法(4D+)——与 3 位数加法相同,只是从 ${\textstyle \lbrack 0,10000)}$ 均匀采样。
•

4 位数减法(4D-)——与 3 位数减法相同,只是从 ${\textstyle \lbrack 0,10000)}$ 均匀采样。
•

5 位数加法(5D+)——与 3 位数加法相同,只是从 ${\textstyle \lbrack 0,100000)}$ 均匀采样。
•

5 位数减法(5D-)——与 3 位数减法相同,只是从 ${\textstyle \lbrack 0,100000)}$ 均匀采样。
•

2 位数乘法(2Dx)——要求模型对从 ${\textstyle \lbrack 0,100)}$ 均匀采样的两个整数相乘,例如"Q: What is 24 times 42? A: 1008"。
•

一位数复合(1DC)——要求模型对三个 1 位数进行复合运算,后两个数加括号。例如,"Q: What is 6+(4*8)? A: 38"。三个 1 位数从 ${\textstyle \lbrack 0,10)}$ 均匀选择,运算符从 {+,-,*} 中均匀选择。

在所有 10 项任务中,模型必须生成完全正确的答案。对于每项任务,我们生成包含 2000 个随机实例的数据集,并在这些实例上评估所有模型。

我们首先在少样本设定下评估 GPT-3,其结果如图 3.10 所示。在加法和减法上,当位数较少时,GPT-3 表现出很强的能力:在 2 位数加法上达到 100% 的准确率,在 2 位数减法上 98.9%,在 3 位数加法上 80.2%,在 3 位数减法上 94.2%。随着位数增加,性能下降,但 GPT-3 在四位数运算上仍达到 25-26% 的准确率,在五位数运算上达到 9-10% 的准确率,这表明它至少具备一定向更多位数泛化的能力。GPT-3 在 2 位数乘法这一计算密集型运算上也达到 29.2% 的准确率。最后,GPT-3 在一位数复合运算(例如 9*(7+5))上达到 21.3% 的准确率,这表明它具有超越单一运算的一定鲁棒性。

设置	2D+	2D-	3D+	3D-	4D+	4D-	5D+	5D-	2Dx	1DC
GPT-3 Zero-shot	76.9	58.0	34.2	48.3	4.0	7.5	0.7	0.8	19.8	9.8
GPT-3 One-shot	99.6	86.4	65.5	78.7	14.0	14.0	3.5	3.8	27.4	14.3
GPT-3 Few-shot	100.0	98.9	80.4	94.2	25.5	26.8	9.3	9.9	29.2	21.3

如图 3.10 所清楚显示的,小型模型在所有这些任务上表现都很差——即便是 130 亿参数的模型(仅次于 1750 亿完整 GPT-3 的第二大模型),也只能在一半的情况下解出 2 位数加减法,其他所有运算的成功率不足 10%。

一样本和零样本的性能相对于少样本性能有所下降,这表明对任务的适应(或至少是识别任务)对于正确执行这些计算很重要。尽管如此,一样本性能仍相当强,而且完整 GPT-3 的零样本性能甚至显著超过所有较小模型的少样本学习。完整 GPT-3 的三种设定结果见表 3.9,三种设定下的模型容量扩展情况见附录 H。

为抽查模型是否只是记忆了特定的算术题,我们将测试集中的 3 位数算术题在训练数据中分别按 "<NUM1> + <NUM2> =" 和 "<NUM1> plus <NUM2>" 两种形式进行搜索。在 2000 道加法题中我们仅找到 17 处匹配(0.8%),在 2000 道减法题中仅找到 2 处匹配(0.1%),这表明只有微不足道的一小部分正确答案可能是记忆得到的。此外,对错误回答的检查表明,模型常常出现诸如忘记进位"1"这样的错误,这表明它实际上是在尝试进行相关计算,而不是记忆某张表。

总体而言,GPT-3 在 few-shot、one-shot 甚至 zero-shot 设置下,在中等复杂的算术上展现出合理的能力。

3.9.2 单词重组与操纵任务

为测试 GPT-3 从少量示例中学习新颖符号操作的能力,我们设计了一组 5 项"字符操作"任务。每项任务都是向模型给出一个经字符乱序、增加或删除等组合方式扭曲的单词,并要求模型还原原始单词。5 项任务分别为:

设置	CL	A1	A2	RI	RW
GPT-3 Zero-shot	3.66	2.28	8.91	8.26	0.09
GPT-3 One-shot	21.7	8.62	25.9	45.4	0.48
GPT-3 Few-shot	37.9	15.1	39.7	67.2	0.44

•

单词中字母循环移位(CL)——模型被给予一个字母经过循环移位的单词,然后是"="符号,期望它生成原始单词。例如,它可能被给予"lyinevitab",应输出"inevitably"。
•

除首末字符外其余字母的乱序(A1)——模型被给予一个除首末字母之外其他字母都被随机打乱的单词,需输出原始单词。示例:criroptuon = corruption。
•

除首末 2 个字符外其余字母的乱序(A2)——模型被给予一个除前 2 个和末 2 个字母之外其他字母都被随机打乱的单词,需还原原始单词。示例:opoepnnt ${\textstyle \rightarrow}$ opponent。
•

单词中随机插入(RI)——在一个单词的每个字母之间随机插入一个标点或空格字符,模型需输出原始单词。示例:s.u!c/c!e.s s i/o/n = succession。
•

反序单词(RW)——模型被给予一个倒序拼写的单词,需输出原始单词。示例:stcejbo ${\textstyle \rightarrow}$ objects。

对于每项任务,我们生成 10000 个示例,我们选择的是按 [92] 度量出现频率最高、长度大于 4 字符且小于 15 字符的前 10000 个单词。少样本结果如图 3.11 所示。任务性能往往随模型规模平滑增长,完整 GPT-3 在去除随机插入上达到 66.9%,在字母循环移位上 38.6%,在较容易的字谜任务上 40.2%,在更困难的字谜任务上(仅固定首末字母)15.1%。所有模型都无法将单词中的字母反序。

在一样本设定下,性能显著下降(下降一半或更多),在零样本设定下,模型几乎无法完成任何一项任务(表 3.10)。这表明,模型确实似乎是在测试时学习了这些任务——因为它在零样本设定下无法完成,而其人工性质使其不太可能出现在Lua error: Internal error: The interpreter exited with status 1.数据中(尽管我们无法完全确认这一点)。

我们可以通过绘制"上下文学习曲线"——即任务表现作为上下文示例数的函数——进一步对性能进行量化。我们在图 1.2 中展示了符号插入任务的上下文学习曲线。可以看出,更大的模型能够越来越有效地利用上下文信息,包括任务示例和自然语言任务描述。

最后值得补充的是,解决这些任务需要字符级的操作,而我们的 BPE 编码作用于一个单词的相当一部分(平均每个 token 约 ${\textstyle \sim 0.7}$ 个词),因此从语言模型的视角看,在这些任务上成功不仅要操作 BPE token,还要理解并拆解其子结构。此外,CL、A1 和 A2 并不是双射的(即还原后的词并非乱序词的确定性函数),要求模型进行一定搜索以找到正确的还原。因此,所涉及的技能似乎需要非平凡的模式匹配和计算。

3.9.3 SAT 类比

为了在另一项相对于典型文本分布而言较为不寻常的任务上测试 GPT-3,我们收集了一组 374 道"SAT 类比题"[131]。类比题是一种多项选择题,2005 年之前曾是美国大学入学考试 SAT 的一个组成部分。一个典型示例是"audacious is to boldness as (a) sanctimonious is to hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to temptation"。考生需从这五对词中选出与原词对关系相同的一对;在该例中答案为"sanctimonious is to hypocrisy"。在该任务上,GPT-3 在少样本设定下达到 65.2%,一样本下 59.1%,零样本下 53.7%,而大学申请者的平均得分为 57% [129](随机猜测为 20%)。如图 3.12 所示,结果随规模而提升,完整的 1750 亿模型相比 130 亿参数模型提升了超过 10%。

3.9.4 新闻文章生成

此前关于生成式语言模型的工作通过给定一段由人撰写的、看起来合理的新闻首句作为提示,从模型条件采样,从而定性地测试其生成合成"新闻文章"的能力 [117]。相对于 [117],GPT-3 的训练数据集中新闻文章的权重要低得多,因此通过原始的无条件采样来生成新闻文章效果较差——例如,GPT-3 经常将所提议的"新闻文章"首句理解为推文,然后产生合成的回复或后续推文。为解决该问题,我们利用 GPT-3 的少样本学习能力,在模型上下文中提供三篇此前的新闻文章作为条件。给出一篇待生成文章的标题和副标题,模型即可可靠地生成"新闻"风格的短篇文章。

为衡量 GPT-3 生成新闻文章的质量(我们认为这很可能与一般的条件采样生成质量相关),我们决定测量人类区分 GPT-3 生成文章与真实文章的能力。类似工作已由 Kreps 等人 [56] 和 Zellers 等人 [141] 完成。生成式语言模型被训练以匹配人类生成内容的分布,因此人类是否能区分两者可能是衡量质量的重要指标。³³3该任务也与第 6.1 节讨论的语言模型的潜在滥用相关。

为了考察人类对模型生成文本的检测能力,我们从网站 newser.com 随意选取了 25 个文章的标题和副标题(平均长度:215 词)。然后,我们用四种参数规模从 1.25 亿到 1750 亿(GPT-3)的语言模型对这些标题和副标题生成续写(平均长度:200 词)。对于每个模型,我们向约 80 位美国本土的参与者展示一份问卷,问卷由这些真实的标题和副标题构成,后面跟随的是人写文章或该模型生成的文章⁴⁴4我们想了解互联网上普通人辨识语言模型输出的能力如何,因此聚焦于从美国普通人群中抽取的参与者。详情见附录 E。参与者被要求选择该文章是"非常可能由人撰写"、"更可能由人撰写"、"我不知道"、"更可能由机器撰写"还是"非常可能由机器撰写"。

我们选用的文章并未出现在各模型的训练数据中,模型输出经程序化的方式进行格式化和选取,以避免人为挑选。所有模型在生成输出时都使用相同的上下文条件,且都以相同的上下文长度进行预训练,且为每个模型使用了相同的文章标题和副标题作为提示。然而,我们还进行了一项实验,用以控制参与者的投入度和Lua error: Internal error: The interpreter exited with status 1.,该实验沿用相同形式,但故意使用质量较差的模型生成的文章。其做法是从一个"对照模型"生成文章:该模型为 1.6 亿参数,不使用上下文,并提高了输出随机性。

每位参与者的人类平均准确率(每位参与者中正确判定与非中立判定的比值)在判定故意写差的文章为模型生成时为 ${\textstyle \sim {86\%}}$ ,其中 50% 为随机水平。相比之下,在判定 175B 参数模型生成的文章时,人类平均准确率仅略高于随机,为 ${\textstyle \sim {52\%}}$ (见表 3.11)。⁵⁵5我们使用双样本 Student T 检验,对每个模型与对照模型的参与者准确率均值之间是否存在显著差异进行检验,并报告均值的归一化差(作为 t 统计量)与 p 值。随着模型规模的增大,人类对模型生成文本的检测能力似乎在下降:存在一种向随机准确率靠拢的趋势,且对 GPT-3 的检测已接近随机水平。⁶⁶6如果某个模型持续产出比人写文章更具吸引力的文本,该任务上的人类表现可能跌破 50%。事实上,在该任务上确有许多个人参与者得分低于 50%。这一现象出现在尽管参与者在每段输出上所花时间随模型规模增大而增加(见附录 E)的情况下。

	Mean accuracy	95% Confidence Interval (low, hi)	${\textstyle t}$ compared to control ( ${\textstyle p}$ -value)	“I don’t know” assignments
Control (deliberately bad model)	86%	83%–90%	-	3.6 %
GPT-3 Small	76%	72%–80%	3.9 (2 ${\textstyle e}$ -4)	4.9%
GPT-3 Medium	61%	58%–65%	10.3 (7 ${\textstyle e}$ -21)	6.0%
GPT-3 Large	68%	64%–72%	7.3 (3 ${\textstyle e}$ -11)	8.7%
GPT-3 XL	62%	59%–65%	10.7 (1 ${\textstyle e}$ -19)	7.5%
GPT-3 2.7B	62%	58%–65%	10.4 (5 ${\textstyle e}$ -19)	7.1%
GPT-3 6.7B	60%	56%–63%	11.2 (3 ${\textstyle e}$ -21)	6.2%
GPT-3 13B	55%	52%–58%	15.3 (1 ${\textstyle e}$ -32)	7.1%
GPT-3 175B	52%	49%–54%	16.9 (1 ${\textstyle e}$ -34)	7.8%

GPT-3 生成的合成文章示例见图 3.14 与 3.15。⁷⁷7更多非新闻类样本见附录 F。如评估所示,其中大部分文本对人类而言都很难与真实的人写内容区分开。事实性错误可能是文章为模型生成的一个指标,因为与人类作者不同,模型并不能获取文章标题所指的具体事实或文章撰写时间。其他指标还包括重复、文不对题以及不寻常的措辞,尽管这些往往足够微妙以至于难以被察觉。

Title: United Methodists Agree to Historic Split Subtitle: Those who oppose gay marriage will form their own denomination Article:After two days of intense debate, the United Methodist Church has agreed to a historic split - one that is expected to end in the creation of a new denomination, one that will be ”theologically and socially conservative,” according to The Washington Post. The majority of delegates attending the church’s annual General Conference in May voted to strengthen a ban on the ordination of LGBTQ clergy and to write new rules that will ”discipline” clergy who officiate at same-sex weddings. But those who opposed these measures have a new plan: They say they will form a separate denomination by 2020, calling their church the Christian Methodist denomination. The Post notes that the denomination, which claims 12.5 million members, was in the early 20th century the ”largest Protestant denomination in the U.S.,” but that it has been shrinking in recent decades. The new split will be the second in the church’s history. The first occurred in 1968, when roughly 10 percent of the denomination left to form the Evangelical United Brethren Church. The Post notes that the proposed split ”comes at a critical time for the church, which has been losing members for years,” which has been ”pushed toward the brink of a schism over the role of LGBTQ people in the church.” Gay marriage is not the only issue that has divided the church. In 2016, the denomination was split over ordination of transgender clergy, with the North Pacific regional conference voting to ban them from serving as clergy, and the South Pacific regional conference voting to allow them.

Title: Star’s Tux Promise Draws Megyn Kelly’s Sarcasm Subtitle: Joaquin Phoenix pledged to not change for each awards event Article:A year ago, Joaquin Phoenix made headlines when he appeared on the red carpet at the Golden Globes wearing a tuxedo with a paper bag over his head that read, ”I am a shape-shifter. I can’t change the world. I can only change myself.” It was a promise to not change to fit into the Hollywood mold: ”I think that’s a really special thing, to not change yourself. I think it’s a really special thing to say, ‘This is what’s inside of me, I’m proud of it, and I’m not going to be ashamed because of the way that someone else thinks I should be.’” Now, it’s the Oscars, and Phoenix is at it again. But this time, his publicist is saying he’ll be wearing a tux no matter what. Megyn Kelly was not impressed, and she let him have it on The Tonight Show. ”You know, I feel like, I feel like you could have worn the tux,” she says. ”But you’re saying you’re a shape-shifter. I don’t know if you can change your tux, but you can change your mind. You can change your mind. You can change your mind.” Phoenix says he did, but it didn’t stick. ”I was like, ‘Okay, I’m going to wear a tuxedo to this thing.’ And then I thought, ‘I don’t want to wear a tuxedo to this thing.’” Kelly goes on to encourage him to change his mind again, but Phoenix says it’s too late: ”I’m committed to wearing this.”

Ippolito 等人 [48] 关于语言模型检测的相关工作表明,Grover [141] 和 GLTR [37] 等自动判别器在检测模型生成文本方面可能比人类评估者更成功。对这些模型的自动检测可能是未来一个有前景的研究方向。

Ippolito 等人 [48] 还指出,随着人类观察到更多的 token,他们检测模型生成文本的准确率会提升。为了初步考察人类对 GPT-3 175B 生成的较长新闻文章的检测能力,我们从路透社选取了 12 篇国际新闻文章(平均长度 569 词),并由 GPT-3 生成这些文章的续写(平均长度 498 词,比我们最初的实验长 298 词)。按照上述方法,我们进行了两个实验,每个实验约有 80 名美国本土参与者,以比较人类对 GPT-3 与对照模型生成文章的检测能力。

我们发现,人类对故意写差的对照模型较长文章的平均检测准确率为 ${\textstyle \sim {88\%}}$ ,而对 GPT-3 175B 生成的较长文章的平均检测准确率仍仅略高于随机,为 ${\textstyle \sim {52\%}}$ (见表 3.12)。这表明,对于约 500 词长度的新闻文章,GPT-3 仍能生成令人类难以与人写新闻文章区分的内容。

	Mean accuracy	95% Confidence Interval (low, hi)	${\textstyle t}$ compared to control ( ${\textstyle p}$ -value)	“I don’t know” assignments
Control	88%	84%–91%	-	2.7%
GPT-3 175B	52%	48%–57%	12.7 (3.2 ${\textstyle e}$ -23)	10.6%

A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is: We were traveling in Africa and we saw these very cute whatpus. To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is: One day when I was playing tag with my little sister, she got really excited and she started doing these crazy farduddles.
A "yalubalu" is a type of vegetable that looks like a big pumpkin. An example of a sentence that uses the word yalubalu is: I was on a trip to Africa and I tried this yalubalu vegetable that was grown in a garden there. It was delicious.

A "Burringo" is a car with very fast acceleration. An example of a sentence that uses the word Burringo is: In our garage we have a Burringo that my father drives to work every day.

A "Gigamuru" is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is: I have a Gigamuru that my uncle gave me as a gift. I love to play it at home.

To "screeg" something is to swing a sword at it. An example of a sentence that uses the word screeg is: We screeghed at each other for several minutes and then we went outside and ate ice cream.

Poor English input: I eated the purple berries. Good English output: I ate the purple berries. Poor English input: Thank you for picking me as your designer. I’d appreciate it. Good English output: Thank you for choosing me as your designer. I appreciate it. Poor English input: The mentioned changes have done. or I did the alteration that you requested. or I changed things you wanted and did the modifications. Good English output: The requested changes have been made. or I made the alteration that you requested. or I changed things you wanted and made the modifications. Poor English input: I’d be more than happy to work with you in another project. Good English output: I’d be more than happy to work with you on another project. Poor English input: Please provide me with a short brief of the design you’re looking for and that’d be nice if you could share some examples or project you did before. Good English output: Please provide me with a brief description of the design you’re looking for and that would be nice if you could share some examples or projects you have done before.

 Poor English input: The patient was died. Good English output: The patient died.

 Poor English input: We think that Leslie likes ourselves. Good English output: We think that Leslie likes us.

 Poor English input: Janet broke Bill on the finger. Good English output: Janet broke Bill’s finger.

 Poor English input: Mary arranged for, in St. Louis, John to rent a house cheap. Good English output: Mary arranged for John to rent a house in St. Louis.

Poor English input: Today I have went to the store to to buys some many bottle of water. Good English output: Today I went to the store to buy some bottles of water.

 Poor English input: I have tried to hit ball with bat, but my swing is has miss. Good English output: I tried to hit the ball with the bat, but my swing missed.

3.9.5 学习并使用新词

发展心理语言学 [13] 中研究的一项任务是学习并使用新词的能力,例如在仅看到一次定义后将该词用于句子中,或者反过来仅凭一次使用就推断出词的含义。这里我们定性地测试 GPT-3 完成前者的能力。具体而言,我们向 GPT-3 给出一个不存在的词(如"Gigamuru")的定义,然后要求它在句中使用该词。我们提供 1 到 5 个先前的示例:每个示例都将一个(不同的)不存在的词进行定义并用于句中,因此该任务在该宽泛任务的先前示例方面是少样本的,而在具体词的方面是一样本的。表 3.16 给出了我们生成的 6 个示例;所有定义均由人工撰写,第一个回答作为条件由人工撰写,后续回答均由 GPT-3 生成。这些示例是在一次会话中连续生成的,我们既未省略也未重复尝试任何提示。所有情况下,生成的句子都是对该词正确或至少合理的使用。在最后一句中,模型为词"screeg"生成了一个合理的变位形式("screeghed"),尽管该词的用法略显别扭("screeghed at each other"),但仍可被解释为描述玩具剑斗,具有一定合理性。总体来说,GPT-3 在将新词用于句中这一任务上似乎至少是熟练的。

3.9.6 纠正英语语法

另一项非常适合少样本学习的任务是修改英文语法。我们在少样本设定下用 GPT-3 进行测试,给出形如 "Poor English Input: <sentence>\n Good English Output: <sentence>" 的提示。我们向 GPT-3 提供一条人工撰写的更正,然后要求它再修改 5 条(同样不省略也不重复)。结果如图 3.17 所示。

4 测量与防止基准的记忆

由于我们的训练数据集来源于互联网,我们的模型有可能在部分基准测试集上进行过训练。准确检测互联网规模数据集中的测试污染是一个新兴的研究领域,尚无成熟的最佳实践。尽管在不调查污染的情况下训练大型模型是常见做法,但鉴于Lua error: Internal error: The interpreter exited with status 1.数据集规模的日益扩大,我们认为这个问题正变得越来越值得关注。

这一担忧并非仅是假设。最早在 Common Crawl 数据上训练语言模型的论文之一 [130] 检测并移除了一篇与其某个评估数据集重叠的训练文档。其他工作,如 GPT-2 [117],也进行了事后的重叠分析。它们的研究结果相对令人鼓舞:尽管模型在训练集与测试集之间存在重叠的数据上确实表现略好,但由于受污染的数据比例很小(通常只有百分之几),这并未对所报告结果产生显著影响。

GPT-3 所处的情形有所不同。一方面,数据集和模型规模比 GPT-2 大约大两个数量级,且包含大量 Common Crawl,这增加了污染和记忆的潜在风险。另一方面,正是由于数据量巨大,即便是 GPT-3 175B 也并未在其训练集上出现显著过拟合(与之去重的留出验证集相对而言,见图 4.1)。因此,我们预计污染很可能频繁存在,但其影响可能不像所担心的那么严重。

我们最初尝试通过主动搜索并试图移除我们训练数据与本文所研究的所有基准的开发集和测试集之间的所有重叠,来处理污染问题。不幸的是,一个错误导致仅有部分检测到的重叠从训练数据中被移除。由于训练成本过高,重新训练模型并不可行。为应对这一问题,我们详细考察了剩余的检测到的重叠对结果的影响。

对于每个基准,我们生成一个"干净"版本,移除所有可能泄漏的样本——大致定义为与Lua error: Internal error: The interpreter exited with status 1.集中任何内容存在 13-gram 重叠的样本(或当样本长度短于 13-gram 时与整个样本重叠的样本)。其目标是非常保守地标记任何可能的污染,从而以高置信度产出一个不含污染的干净子集。具体过程详见附录 C。

随后我们在这些干净基准上评估 GPT-3,并与原始分数比较。如果干净子集上的分数与整个数据集上的分数相近,这表明即使存在污染,其对所报告结果的影响也并不显著。若干净子集上的分数更低,则提示污染可能在抬高结果。结果汇总于图 4.2。尽管潜在污染往往很高(四分之一的基准的潜在污染得分超过 50%),但在大多数情况下,性能仅有可忽略的变化,我们也未看到污染水平与性能差异之间相关的证据。我们由此得出结论:要么我们的保守方法实质性地高估了污染,要么污染对性能的影响很小。

下文我们更详细地回顾少数特定情形:(1)模型在干净版本上表现明显更差,或(2)潜在污染非常高,使得性能差异难以测量。

我们的分析标记了六组基准以供进一步调查:Word Scrambling、阅读理解(QuAC、SQuAD2、DROP)、PIQA、Winograd、语言建模任务(Wikitext 任务、1BW)以及德译英翻译。由于我们的重叠分析被设计得极为保守,我们预计会出现一些误报。下文我们对每组任务的结果进行汇总:

•

阅读理解:我们最初的分析将 QuAC、SQuAD2 和 DROP 的 ${\textstyle >}$ 90% 的任务样本标记为潜在污染,占比之大以至于在干净子集上测量差异都很困难。然而,经人工检查,我们发现在所有 3 个数据集中,我们检查过的每一处重叠中,源文本都存在于我们的训练数据中,但问答对并不存在,这意味着模型只获得了背景信息,无法记住对某个具体问题的答案。
•

德语翻译:我们发现 WMT16 德译英测试集中 25% 的样本被标记为潜在污染,相关的总效应大小为 1-2 BLEU。经检查,被标记的样本中没有一例包含与 NMT 训练数据相似的配对句子,碰撞都是单语匹配,且多为新闻中所讨论事件的片段。
•

倒序词与字谜:回顾这些任务的形式如"alaok = koala"。由于这些任务长度较短,我们使用 2-gram 进行过滤(忽略标点)。检查标记的重叠后,我们发现它们通常并不是训练集中真实的反序或还原实例,而是回文或平凡的还原,例如"kayak = kayak"。重叠的量很小,但去掉这些平凡情况导致难度上升,从而出现虚假的信号。与之相关的是,符号插入任务显示出高度重叠,但对性能没有影响——这是因为该任务涉及从单词中移除非字母字符,而重叠分析本身忽略此类字符,导致大量虚假匹配。
•

PIQA:重叠分析将 29% 的样本标记为污染,并观察到干净子集上的性能绝对下降 3 个百分点(相对下降 4%)。尽管测试数据集是在我们训练集创建之后发布的,且其标签被隐藏,但众包数据集创建者所使用的部分网页被包含在我们的训练集中。我们在一个小 25 倍、记忆能力远更弱的模型上也观察到类似下降,这使我们怀疑该偏移更可能是统计偏差而非记忆;众包工作者复制的样本可能本身更简单。可惜的是,我们无法严格证明这一假设。因此,我们用星号标注 PIQA 的结果,以表示该潜在污染。
•

Winograd:重叠分析标记了 45% 的样本,在干净子集上发现性能下降 2.6%。对重叠数据点的人工检查表明,确有 132 个 Winograd 模式存在于我们的训练集中,尽管其呈现格式与我们向模型呈现该任务时不同。尽管性能下降幅度不大,我们仍在正文中用星号标注 Winograd 的结果。
•

语言建模:我们发现 GPT-2 中测量的 4 个维基百科语言建模基准,加上 Children's Book Test 数据集,几乎全部包含在我们的训练数据中。由于我们无法在这里可靠地抽取一个干净子集,我们不报告这些数据集上的结果,尽管在开始这项工作时本来打算报告。我们注意到,Penn Tree Bank 因其历史悠久而不受影响,因而成为我们的主要语言建模基准。

我们还检查了那些污染很高、但对性能影响接近零的数据集,只是为了核实其中真实存在多少污染。这些数据集往往看似存在大量误报。它们要么实际上没有污染,要么有的污染并不会泄露任务的答案。一个值得注意的例外是 LAMBADA,它似乎确实存在大量真实污染,然而对性能的影响非常小:干净子集的得分与完整数据集相差不到 0.5%。此外,严格来说,我们的填空形式排除了最简单的记忆方式。尽管如此,鉴于我们在本文中在 LAMBADA 上取得了非常大的提升,我们仍在结果部分注明了潜在污染。

我们污染分析的一个重要限制是,我们无法确定干净子集是否与原始数据集来自相同的分布。仍然可能的是,记忆抬高了结果,但同时被某种使干净子集更易解的统计偏差恰好抵消。然而,如此多的偏移接近于零这一现象表明这种情况不太可能;此外,我们也未在不太可能存在记忆现象的小模型的偏移上观察到明显差异。

总体而言,我们已尽最大努力测量并记录数据污染的影响,并视严重程度对存在问题的结果进行标注或直接移除。在基准设计与模型训练上,本领域要妥善应对这一重要而微妙的问题,仍有大量工作要做。关于我们分析的更详细说明,请读者参阅附录 C。

5 局限性

GPT-3 及我们对它的分析存在若干局限性。下面我们描述其中一些并建议未来的工作方向。

首先,尽管 GPT-3 在数量与质量方面都有显著提升——尤其是与其直接前身 GPT-2 相比——它在文本合成与若干 NLP 任务上仍存在明显弱点。在文本合成方面,尽管整体质量较高,GPT-3 的样本有时仍会在文档级别上语义重复、在足够长的段落中开始失去连贯性、自相矛盾,并偶尔包含离题的句子或段落。我们将公开 500 个未经筛选的无条件样本,以帮助更好地呈现 GPT-3 在文本合成上的局限与长处。在离散语言任务领域,我们非正式地注意到 GPT-3 在"常识物理"上似乎尤其困难,尽管它在某些测试该领域的数据集(如 PIQA [11])上表现良好。具体来说,GPT-3 难以应对诸如"如果我把奶酪放进冰箱,它会融化吗?"之类的问题。从量化角度来看,如第 3 节所述,GPT-3 的上下文学习性能在我们的基准套件上存在一些明显的差距,尤其是在某些"比较"任务上(例如判断同一个词在两句中是否以相同方式使用,或一句是否蕴含另一句,即 WIC 和 ANLI),以及在阅读理解任务的一个子集上,即便在一样本甚至少样本设定下也仅略胜于随机。鉴于 GPT-3 在许多其他任务上具有强劲的少样本表现,这一点尤其令人侧目。

GPT-3 存在若干结构性和算法性的限制,这些限制可能部分解释了上述问题。我们专注于探索自回归语言模型中的上下文学习行为,因为该类模型既便于采样也便于计算似然。结果是,我们的实验不包括任何双向架构或诸如去噪等其他训练目标。这与近期大量文献存在明显差异——后者记录了在标准语言模型基础上使用这些方法可以改善Lua error: Internal error: The interpreter exited with status 1.性能 [116]。因此,我们的设计选择以在经验上受益于双向性的任务上可能表现更差为代价。这些任务可能包括填空任务、涉及回顾并比较两段内容的任务,或者要求重新阅读、仔细考虑长段落后再生成极短答案的任务。这或许是 GPT-3 在 WIC(涉及比较同一词在两句中的用法)、ANLI(涉及比较两句以判断蕴含关系)以及若干阅读理解任务(例如 QuAC 和 RACE)上少样本表现落后的可能解释之一。我们还根据以往文献推测,规模与 GPT-3 相当的大型双向模型在Lua error: Internal error: The interpreter exited with status 1.方面会比 GPT-3 更强。在 GPT-3 这一规模上构建双向模型,和/或尝试让双向模型在少样本或零样本学习中工作,是未来研究的一个有前景的方向,有望实现"双赢"。

本文所述一般方法的一个更根本的局限——即扩展任何类语言模型(无论自回归还是双向)——在于它最终可能(或已经可能)碰到Lua error: Internal error: The interpreter exited with status 1.目标的极限。我们当前的目标对每个 token 一视同仁,缺乏"什么最值得预测、什么不那么重要"的概念。[115] 表明针对感兴趣实体定制预测会带来益处。此外,在自监督目标下,任务规范依赖于将所需任务硬塞入一个预测问题中;但终究有用的语言系统(例如虚拟助手)或许更应被视为执行目标导向的行动,而不仅仅是做预测。最后,大型预训练语言模型并未在其他经验领域(如视频或现实世界的物理互动)中被"接地",因此缺乏大量关于世界的语境 [9]。出于上述所有原因,纯自监督预测的扩展很可能会触及瓶颈,使用不同方法进行增强可能是必要的。未来沿此思路有前景的方向包括从人类那里学习Lua error: Internal error: The interpreter exited with status 1. [143]、用强化学习进行Lua error: Internal error: The interpreter exited with status 1.,或加入图像等额外模态以提供"接地"并构建更好的世界模型 [18]。

语言模型普遍存在的另一个局限是Lua error: Internal error: The interpreter exited with status 1.阶段样本效率较差。尽管 GPT-3 在测试时的样本效率上向人类水平迈出了一步(一样本或零样本),但其在Lua error: Internal error: The interpreter exited with status 1.中所见的文本量仍远超人一生中所见的文本量 [71]。提升Lua error: Internal error: The interpreter exited with status 1.的样本效率是未来工作的一个重要方向,可能来自将模型与物理世界相关联以提供额外信息,或来自算法上的改进。

GPT-3 的少样本学习相关的一项局限——或者至少是不确定性——在于:少样本学习究竟是在推理时真正"从零开始"学习新任务,还是仅在识别并确认其在训练中已学过的任务,这一点尚不明朗。这些可能性构成了一个连续谱:从训练集中所提供的演示与测试时来自完全相同的分布,到识别相同任务但以不同格式呈现,再到适应通用任务(如 QA)的特定风格,直到完全从零学习一项技能。GPT-3 在这一谱上的位置可能因任务而异。诸如打乱字母还原或定义无意义词的合成任务,似乎尤其可能是从零学习的;而翻译显然必须在Lua error: Internal error: The interpreter exited with status 1.期间习得,尽管所依据的数据在组织与风格上可能与测试数据差异很大。归根到底,即便对人类而言,何为"从零学习"、何为"从先前示范习得"也并不清楚。即便仅是在Lua error: Internal error: The interpreter exited with status 1.阶段组织多样化的示范、并在测试时予以识别,这本身也将是语言模型的进步,但要精确理解少样本学习的工作机制,仍是未来研究中一个重要而尚未充分探索的方向。

与 GPT-3 这一规模的模型相关的一个局限——无论使用哪种Lua error: Internal error: The interpreter exited with status 1.或算法——是它们的推理过程昂贵且不便,这可能对此类规模模型当前形态下的实际可用性构成挑战。应对这一问题的一个可能的未来方向是将大模型蒸馏 [44] 为针对特定任务规模可控的模型。GPT-3 之类的大型模型包含极广泛的技能,大多数技能对于具体任务并非必需,这暗示原则上有可能进行激进的蒸馏。蒸馏在总体上得到了较好探索 [69],但尚未在数千亿参数规模上尝试过;在如此规模的模型上应用蒸馏可能伴随新的挑战与机遇。

最后,GPT-3 与大多数Lua error: Internal error: The interpreter exited with status 1.系统一样存在一些共同的局限:其决策不易解释;在面对新输入时,它的预测不一定校准良好,这一点从其在标准基准上比人类高得多的性能方差可以看出;它还保留了所训练数据中的偏见。最后这一问题——数据中的偏见可能导致模型生成刻板或带有偏见的内容——从社会角度看尤其值得关注,我们将在下一节"更广泛的影响"(第 6 节)中与其他问题一并讨论。

6 更广泛的影响

语言模型对社会具有广泛有益的应用,包括代码与文本的自动补全、语法辅助、游戏叙事生成、改善搜索引擎响应以及回答问题。但它们也有潜在有害的应用。GPT-3 相比小模型提升了文本生成的质量与可适应性,并增加了将合成文本与人写文本区分开来的难度。因此,它有潜力同时推动语言模型的有益应用与有害应用。

此处我们聚焦于改进后的语言模型可能带来的危害,这并非因为我们认为危害必然更大,而是为激发研究与缓解危害的努力。这类语言模型带来的更广泛影响是多方面的。我们聚焦两大主要问题:GPT-3 等语言模型被刻意滥用的潜在风险(第 6.1 节),以及 GPT-3 这类模型中的偏见、公平性与表征问题(第 6.2 节)。我们也简要讨论了能源效率问题(第 6.3 节)。

6.1 语言模型的滥用

语言模型的恶意使用可能较难预料,因为它们常常涉及在与研究者初衷大相径庭的环境中或出于不同目的而对语言模型进行重新利用。为帮助分析这一点,我们可以借助传统的安全风险评估框架来思考——其中列出关键步骤,例如识别威胁与潜在影响、评估发生可能性,并将风险判定为可能性与影响的组合 [113]。我们讨论三个因素:潜在的滥用应用、威胁行为者和外部激励结构。

6.1.1 潜在的滥用应用

任何依赖生成文本的、对社会有害的活动,都可能因强大的语言模型而被增强。例如:虚假信息、垃圾邮件、网络钓鱼、对法律和政府流程的滥用、学术论文造假以及社会工程中的借口构造。其中许多应用的瓶颈都在于由人来撰写质量足够高的文本。能够产生高质量文本生成的语言模型可能降低进行这些活动的现有门槛,并提高其有效性。

语言模型被滥用的潜在风险会随着文本合成质量的提升而增加。第 3.9.4 节中,GPT-3 能够生成数个段落、被人们认为难以与人写文本区分的合成内容,这在该方面构成了一个令人担忧的里程碑。

6.1.2 威胁行为者分析

威胁行为者可按技能与资源水平加以组织,从技能与资源较低或中等、能够构建恶意产品的行为者,到"高级持续性威胁"(APT):技能高超、资源充足(例如国家支持)的群体,具有长期目标 [119]。

为了解低技能和中等技能行为者对语言模型的看法,我们一直在监控经常讨论虚假信息策略、恶意软件分发和计算机欺诈的论坛与聊天群组。在 GPT-2 于 2019 年春首次发布之后,我们的确发现了大量关于滥用的讨论,但此后我们发现实验性尝试较少,也没有成功部署。此外,这些滥用相关讨论与媒体对语言模型技术的报道相关。由此我们认为,这些行为者构成的滥用威胁并非迫在眉睫,但可靠性的显著提升可能改变这一情况。

由于 APT 通常不会公开讨论行动,我们咨询了专业的威胁分析师,询问其有关使用语言模型的可能 APT 活动。自 GPT-2 发布以来,可能因使用语言模型而获益的行动并未出现可辨识的差异。其评估是:语言模型可能不值得投入重大资源,因为目前并无令人信服的证据表明现有语言模型在生成文本方面显著优于现有方法;此外,用于"定向"或"控制"语言模型内容的方法仍处于非常早期的阶段。

6.1.3 外部激励结构

每个威胁行为者群体也都有一套用于实现其目标的战术、技术与流程(TTPs)。TTPs 受可扩展性与部署易用度等经济因素的影响;网络钓鱼在所有群体中极为流行,因为它提供了一种低成本、低投入、高产出的部署恶意软件和窃取登录凭证的方法。利用语言模型来增强现有 TTPs,可能会进一步降低部署成本。

易用性是另一项重要的激励。基础设施的稳定性对 TTPs 的采用影响很大。然而,语言模型的输出是随机的,尽管开发者可以约束输出(例如使用 top-k 截断),但若没有人类反馈,它们无法表现得始终一致。如果某个社交媒体上的虚假信息机器人 99% 的时间能产生可靠的输出,但 1% 的时间产生不连贯的输出,这可能减少操作该机器人所需的人力。但仍然需要人来过滤输出,这限制了该操作的可扩展性。

基于我们对该模型的分析,以及对威胁行为者和整体形势的分析,我们怀疑 AI 研究者最终会开发出足够一致且可导向的语言模型,从而对恶意行为者更具吸引力。我们预计这将给更广泛的研究社区带来挑战,并希望通过缓解性研究、原型构建以及与其他技术开发者的协调,共同应对这一问题。

6.2 公平性、偏差与代表性

训练数据中存在的偏见可能导致模型生成刻板或带偏见的内容。这令人担忧,因为模型偏见可能通过固化既有刻板印象、产生贬损性描绘等多种方式伤害相关群体中的人们,以及其他潜在的危害 [19]。我们对模型的偏见进行了分析,以更好地理解 GPT-3 在公平性、偏见和表征方面的局限。⁸⁸8评估语言模型中的公平性、偏见与表征是一个快速发展的领域,已有大量先前工作。例如可参见 [46, 90, 120]。

我们的目标并非对 GPT-3 进行穷尽式刻画,而是对其部分局限和行为做出初步分析。我们聚焦于性别、种族和宗教相关的偏见,尽管很可能还存在许多其他类别的偏见,并可在后续工作中加以研究。这是一项初步分析,即便在所研究的类别内,也并未反映模型的所有偏见。

总体而言,我们的分析表明,在互联网上训练的模型具有互联网规模的偏见;模型倾向于反映其训练数据中存在的刻板印象。下文我们沿着性别、种族和宗教的维度,讨论我们关于偏见的初步发现。我们对 1750 亿参数模型以及相似的更小模型探查偏见,以观察它们在此维度上是否以及如何不同。

6.2.1 性别

在我们对 GPT-3 性别偏见的调查中,我们聚焦于性别与职业之间的关联。我们发现,在给出例如 "The {occupation} was a"(中性变体)的上下文时,职业总体上更可能被男性别标识(而非女性别标识)所跟随(也就是说,偏向男性)。在我们测试的 388 个职业中,83% 在 GPT-3 中更可能被男性标识所跟随。我们的衡量方式是,向模型输入诸如 "The detective was a" 的上下文,然后查看模型继续使用男性指示词(如 man, male 等)或女性指示词(如 woman, female 等)的概率。值得一提的是,体现较高教育程度的职业如 legislator、banker 或 professor emeritus 严重偏向男性,需要重体力劳动的职业如 mason、millwright 和 sheriff 也是如此。更可能被女性标识跟随的职业包括 midwife、nurse、receptionist、housekeeper 等。

我们还测试了当我们将上下文改为 "The competent {occupation} was a"(胜任变体)以及 "The incompetent {occupation} was a"(不胜任变体),针对数据集中每个职业,这些概率会如何变化。我们发现,在提示 "The competent {occupation} was a," 下,大多数职业被男性标识跟随的概率,比我们最初的中性提示 "The {occupation} was a" 还要更高于女性标识。在提示 "The incompetent {occupation} was a" 下,大多数职业仍以与原始中性提示相近的概率偏向男性。平均职业偏见——以 ${\textstyle \frac{1}{n_{jobs}}\hspace{0pt}{\sum_{jobs}{\log{(\frac{P\hspace{0pt}{(\left. {female} \middle| {Context} \right.)}}{P{({male}|{Context})})})}}}}$ 衡量——中性变体为 ${\textstyle - 1.11}$ ,胜任变体为 ${\textstyle - 2.14}$ ,不胜任变体为 ${\textstyle - 1.15}$ 。

我们还在 Winogender 数据集 [111] 上进行了代词消解,采用两种方法,进一步印证了模型将大多数职业与男性关联的倾向。一种方法测量模型能否正确将代词指派给职业方或参与方。例如,我们向模型输入诸如 "The advisor met with the advisee because she wanted to get advice about job applications. 'She' refers to the" 的上下文,并在两个候选项之间(职业选项:advisor;参与方选项:advisee)挑出概率较低者。

职业词和参与者词通常带有社会偏见,例如默认绝大多数从业者为男性。我们发现语言模型学到了其中一些偏见,例如更倾向于将女性代词与参与方位置相关联,胜过男性代词。GPT-3 175B 在该任务上取得了所有模型中最高的准确率(64.17%)。它也是唯一一个对于女性 Occupant 句子(正确答案为 Occupation 选项的句子)的准确率高于男性的模型(81.7% 对 76.7%)。除我们的次大模型 GPT-3 13B(对两者准确率均为 60%)之外,所有其他模型在 Occupation 句子上对男性代词的准确率都高于对女性代词的准确率。这提供了一些初步证据:在偏见可能使语言模型出错的地方,较大的模型比较小模型更具鲁棒性。

我们还进行了共现测试,分析哪些词更可能出现在另一些预先选定的词附近。我们通过对数据集中每个提示,在温度 1 和 top_p 0.9 下,生成 800 条长度为 50 的输出,从而构建了一个模型输出样本集。对于性别,我们使用了诸如 "He was very"、"She was very"、"He would be described as"、"She would be described as"999我们只使用男性和女性代词。这一简化假设便于研究共现,因为它无需将"they"指代单数名词的实例与不指代单数名词的实例区分开;但其他形式的性别偏见很可能存在,可以使用不同方法进行研究。 等提示。我们使用现成的词性标注器 [60] 关注前 100 个最常出现词中的形容词和副词。我们发现,女性更常被以外貌取向的词如"beautiful"和"gorgeous"描述,而男性则更常被覆盖更广光谱的形容词描述。

表 6.1 给出了模型最偏好的前 10 个描述性词,以及每个词与代词指示词共现的原始次数。这里的"最偏好"指那些与某一类共现率高于另一类、因而最偏向该类的词。为了让这些数字更具参考意义,我们还列出了每种性别下所有合规词共现次数的平均值。

Top 10 Most Biased Male Descriptive Words with Raw Co-Occurrence Counts	Top 10 Most Biased Female Descriptive Words with Raw Co-Occurrence Counts
Average Number of Co-Occurrences Across All Words: 17.5	Average Number of Co-Occurrences Across All Words: 23.9
Large (16)	Optimistic (12)
Mostly (15)	Bubbly (12)
Lazy (14)	Naughty (12)
Fantastic (13)	Easy-going (12)
Eccentric (13)	Petite (10)
Protect (10)	Tight (10)
Jolly (10)	Pregnant (10)
Stable (9)	Gorgeous (28)
Personable (22)	Sucked (8)
Survive (7)	Beautiful (158)

6.2.2 种族

为研究 GPT-3 的种族偏见,我们用诸如 "The {race} man was very"、"The {race} woman was very" 和 "People would describe the {race} person as" 等提示作为种子,对每个上述提示生成 800 个样本,其中 {race} 被替换为表示种族类别的术语,如 White 或 Asian。然后,我们在生成样本中测量词共现。鉴于先前研究表明,语言模型在改变诸如职业等特征时所生成文本的情感会有所不同 [46],我们考察了种族对情感的影响。我们使用 Senti WordNet [7] 对那些与每个种族不成比例共现的词测量情感。每个词的情感值从 100 到 -100 不等,正分表示正面词(例如 wonderfulness: 100, amicable: 87.5),负分表示负面词(例如 wretched: -87.5, horrid: -87.5),0 分表示中性词(例如 sloping, chalet)。

需要指出的是,我们是在明确提示模型谈论种族,这反过来生成了聚焦于种族特征的文本;这些结果并非来自模型在自然情境下谈论种族,而是在被诱导的实验设置中谈论种族。此外,由于我们仅通过词共现来度量情感,所得情感可能反映社会历史因素——例如,关于奴隶制讨论的文本经常带有负面情感,这可能在该测试方法下导致某个人口群体与负面情感相关联。

在我们分析的模型中,"Asian"的情感一直较高——在 7 个模型中的 3 个里排名第 1。另一方面,"Black"的情感一直较低——在 7 个模型中的 5 个里排名最末。这些差异在更大规模的模型上略有缩小。该分析让我们对不同模型的偏见有一定的了解,并凸显了对情感、实体与输入数据之间关系进行更精细分析的必要性。

6.2.3 宗教

我们研究了与无神论、佛教、基督教、印度教、伊斯兰教和犹太教相关宗教术语共现的词语,方法是为每个提示在温度 1 和 top ${\textstyle p}$ 为 ${\textstyle 0.9}$ 下生成 800 条长度 ${\textstyle \approx}$ 50 的模型输出。我们的提示形式为 "{Religion practitioners} are"(例如 "Christians are"),覆盖上述六个宗教类别。然后让模型自然地进行续写,并将这些续写组成语料,以研究词共现。

以下是模型的一个示例输出:
"Buddhists are divided into two main branches - Theravada and Mahayana. Theravada is the more conservative branch, centering on monastic life and the earliest sutras and refusing to recognize the later Mahayana sutras as authentic."

与种族类似,我们发现模型对宗教术语的关联,反映出在一定程度上模仿了这些术语在现实世界中有时被呈现的方式。例如,对于宗教 Islam,我们发现诸如 ramadan、prophet 和 mosque 等词的共现率高于其他宗教。我们还发现,在 GPT-3 中,诸如 violent、terrorism 和 terrorist 等词与伊斯兰教的共现率高于其他宗教,并位列伊斯兰教最偏好的前 40 个词中。

Religion	Most Favored Descriptive Words
Atheism	‘Theists’, ‘Cool’, ‘Agnostics’, ‘Mad’, ‘Theism’, ‘Defensive’, ‘Complaining’, ‘Correct’, ‘Arrogant’, ‘Characterized’
Buddhism	‘Myanmar’, ‘Vegetarians’, ‘Burma’, ‘Fellowship’, ‘Monk’, ‘Japanese’, ‘Reluctant’, ‘Wisdom’, ‘Enlightenment’, ‘Non-Violent’
Christianity	‘Attend’, ‘Ignorant’, ‘Response’, ‘Judgmental’, ‘Grace’, ‘Execution’, ‘Egypt’, ‘Continue’, ‘Comments’, ‘Officially’
Hinduism	‘Caste’, ‘Cows’, ‘BJP’, ‘Kashmir’, ‘Modi’, ‘Celebrated’, ‘Dharma’, ‘Pakistani’, ‘Originated’, ‘Africa’
Islam	‘Pillars’, ‘Terrorism’, ‘Fasting’, ‘Sheikh’, ‘Non-Muslim’, ‘Source’, ‘Charities’, ‘Levant’, ‘Allah’, ‘Prophet’
Judaism	‘Gentiles’, ‘Race’, ‘Semites’, ‘Whites’, ‘Blacks’, ‘Smartest’, ‘Racists’, ‘Arabs’, ‘Game’, ‘Russian’

6.2.4 未来的偏差与公平性挑战

我们呈现这一初步分析,旨在分享我们发现的部分偏见,以推动进一步研究,并凸显在大规模生成模型中刻画偏见所固有的困难;我们预计这会是我们持续研究的领域,并希望与社区一道讨论不同的方法论思路。我们将本节的工作视为一种主观的标识——我们选择以性别、种族和宗教作为起点,但我们也承认这一选择本身具有主观性。我们的工作受到刻画模型属性、形成可读取标签(例如 [89] 的 Model Cards for Model Reporting)文献的启发。

归根到底,刻画语言系统中的偏见之外,更重要的是进行干预。这方面的文献也很丰富 [104, 46],因此我们仅就大型语言模型相关的未来方向给出几点简短评论。要为通用模型中有效的偏见预防铺平道路,需要构建一个共同的术语体系,将这些模型偏见缓解的规范性、技术性与经验性挑战相连接。还有更多研究空间——这些研究应与 NLP 之外的文献对话,更清晰地阐述关于伤害的规范性主张,并关注受 NLP 系统影响的群体的真实体验 [4]。因此,缓解工作不应单纯以"消除"偏见这一指标驱动的目标进行——因为这已被证明存在盲点 [32, 93]——而应以整体性方式开展。

6.3 能源消耗

实用的大规模Lua error: Internal error: The interpreter exited with status 1.需要大量计算,而这非常耗能:训练 GPT-3 175B 在Lua error: Internal error: The interpreter exited with status 1.阶段消耗了数千 petaflop/s-day 的计算量,相比之下,15 亿参数的 GPT-2 模型为数十 petaflop/s-day(图 2.2)。这意味着我们应该关注此类模型的成本与效率,正如 [122] 所倡导的那样。

大规模Lua error: Internal error: The interpreter exited with status 1.的使用还提供了观察大型模型效率的另一种视角——我们不仅应考虑训练它们所投入的资源,还应考虑这些资源如何在模型的整个生命周期中被摊销,而这些模型随后会被用于各种用途并针对特定任务进行微调。尽管 GPT-3 这样的模型在训练时消耗大量资源,但训练完成后它们可以出乎意料地高效:即便使用完整的 GPT-3 175B,从已训练好的模型生成 100 页内容的能耗约为 0.4 kW-hr,折算成电费仅几美分。此外,模型蒸馏 [69] 等技术可以进一步降低此类模型的成本,使我们能够采用"训练单一大规模模型,再为其在适当情境下创建更高效版本"的范式。算法进步也可能自然地随时间继续提升此类模型的效率,这与图像识别和神经机器翻译领域中所观察到的趋势相似 [39]。

7 相关工作

若干研究方向都聚焦于通过增加语言模型的参数量和/或计算量来提升其生成或任务性能。早期工作将基于 Lua error: Internal error: The interpreter exited with status 1. 的语言模型扩展到超过十亿参数 [51]。一条研究线直接扩大 Lua error: Internal error: The interpreter exited with status 1. 模型的规模,使参数量与每 token 的 FLOPS 大致成比例增长。该方向的工作不断增大模型规模:原论文中的 2.13 亿参数 [134],3 亿参数 [20],15 亿参数 [117],80 亿参数 [125],110 亿参数 [116],以及最近的 170 亿参数 [132]。第二条研究线着力于增加参数数量但不增加计算量,以在不增加计算成本的前提下提升模型存储信息的能力。这类方法依赖条件计算框架 [10],尤其是Lua error: Internal error: The interpreter exited with status 1.方法 [124] 被用于训练 1000 亿参数的模型,以及最近的 500 亿参数翻译模型 [3],尽管每次前向传播实际只使用一小部分参数。第三种方法是在不增加参数的情况下增加计算量;此类方法的例子包括自适应计算时间 [35] 和通用 Lua error: Internal error: The interpreter exited with status 1. [22]。我们的工作聚焦于第一种方法(同步扩展计算与参数,直接通过加大神经网络规模),并将模型规模相比此前采用此策略的模型再扩大 10 倍。

若干工作也系统地研究了规模对语言模型性能的影响。[57, 114, 77, 42] 发现,随着自回归语言模型规模扩大,损失呈平滑的幂律趋势。该工作表明,随着模型继续扩大,这一趋势在很大程度上仍然成立(尽管在图 3.1 中或许可以察觉到曲线略有弯曲);我们也发现,在三个数量级的规模扩展中,许多(虽非全部)下游任务上的提升相对平滑。

另一条研究线则与扩展方向相反,试图在尽可能小的语言模型上保持强劲性能。该方法包括 ALBERT [62],以及对语言模型进行一般性 [44] 与任务特定 [121, 52, 59] 蒸馏的方法。这些架构与技术可能与我们的工作互补,可应用于降低巨型模型的延迟和内存占用。

随着微调后的语言模型在许多标准基准任务上接近人类水平,人们已投入大量精力构建更困难或更开放的任务,包括问答 [58, 47, 14, 84]、阅读理解 [16, 106],以及对抗性构造、专门设计为对现有语言模型困难的数据集 [118, 94]。在本工作中,我们在其中许多数据集上对模型进行了评估。

此前的许多工作专门聚焦于问答,这也占据了我们测试任务的相当一部分。近期的工作包括 [116, 115],它们微调了 110 亿参数的语言模型,以及 [33],它聚焦于在测试时对一个大型数据语料施加注意力。我们的工作不同之处在于聚焦上下文学习,但未来可与 [33, 75] 的工作相结合。

[117] 中已使用语言模型上的元学习,但结果受限较多,亦未进行系统研究。更广泛地说,语言模型元学习具有内层循环—外层循环结构,使其结构上与一般机器学习中的元学习相似。该方向已有丰富的文献,包括 matching networks [133]、RL2 [26]、learning to optimize [109, 1, 73] 和 MAML [30]。我们将先前示例填入模型上下文的方法,在结构上最接近 RL2,也类似于 [45]:适应的内层循环通过模型在各时间步的Lua error: Internal error: The interpreter exited with status 1.中进行计算,而不更新权重;而外层循环(此处仅为语言模型Lua error: Internal error: The interpreter exited with status 1.)更新权重,并隐式学习对推理时定义的任务进行适应或至少识别。少样本自回归密度估计在 [107] 中得到了探讨,而 [38] 将低资源 NMT 作为少样本学习问题进行了研究。

尽管我们的少样本方法机制有所不同,先前的工作也探索了将预训练语言模型与Lua error: Internal error: The interpreter exited with status 1.相结合进行少样本学习的方法 [126]。另一个具有相似目标的子领域是半监督学习,其中如 UDA [137] 等方法也探讨了在有标注数据极少的情况下的Lua error: Internal error: The interpreter exited with status 1.方法。

以自然语言向多任务模型给出指令,最早是在 [87] 中以监督设定形式化的,并在 [117] 中用于语言模型的某些任务(如摘要)。以自然语言呈现任务的思路也在文本到文本的 Lua error: Internal error: The interpreter exited with status 1. [116] 中进行了探讨,但其中是用于多任务Lua error: Internal error: The interpreter exited with status 1.,而非用于无权重更新的上下文学习。

另一种增强语言模型通用性和迁移学习能力的方法是多任务学习 [12],它在一组下游任务上联合微调,而不是为每个任务分别更新权重。若多任务学习成功,可使单一模型用于许多任务而不更新权重(类似我们的上下文学习方法),或者在为新任务更新权重时提升样本效率。多任务学习已显示出一些有前景的初步结果 [67, 76],而多阶段Lua error: Internal error: The interpreter exited with status 1.近来已成为某些数据集上最先进结果的标准做法 [97],并在某些任务上推动了边界 [55],但仍受限于需手动整理数据集合并设置训练课程。相比之下,在足够大规模上的Lua error: Internal error: The interpreter exited with status 1.似乎提供了"自然"且广泛的任务分布,这一分布隐含于对文本本身的预测中。未来一个方向可能是尝试为多任务学习生成更广泛的显式任务集合,例如通过程序化生成 [128]、人类交互 [144] 或主动学习 [80]。

过去两年中,语言模型的算法创新巨大,包括基于去噪的双向性 [20]、prefixLM [24] 与编码-解码架构 [72, 116]、训练过程中的随机置换 [139]、提升采样效率的架构 [28]、数据与训练流程方面的改进 [74],以及Lua error: Internal error: The interpreter exited with status 1.参数效率的提升 [62]。其中许多技术在下游任务上带来了显著收益。在本工作中,我们继续聚焦于纯自回归语言模型,既是为了聚焦上下文学习性能,也是为了降低大型模型实现的复杂性。然而,引入这些算法进展很可能可以提升 GPT-3 在下游任务上的表现,尤其是在Lua error: Internal error: The interpreter exited with status 1.设定下,将 GPT-3 的规模与这些算法技术相结合是未来工作的一个有前景的方向。

8 结论

我们提出了一个 1750 亿参数的语言模型,该模型在零样本、一样本和少样本设定下,在许多自然语言处理任务和基准上展现出强劲表现,某些情况下几乎与最先进的微调系统相当,并能生成高质量样本,在即时定义的任务上也展现出强劲的定性表现。我们记录了在不使用Lua error: Internal error: The interpreter exited with status 1.情况下,性能随规模扩展的大致可预测趋势。我们也讨论了此类模型的社会影响。尽管存在许多局限和弱点,这些结果表明,非常大型的语言模型可能是开发可适应、通用的语言系统的重要组成部分。

致谢

作者感谢 Ryan Lowe 对论文草稿提供的详细反馈。感谢 Jakub Pachocki 和 Szymon Sidor 提议任务,感谢 Greg Brockman、Michael Petrov、Brooke Chan 和 Chelsea Voss 帮助在 OpenAI 基础设施上运行评估。感谢 David Luan 在该项目扩展之初提供支持,感谢 Irene Solaiman 关于如何应对与评估偏见的讨论,感谢 Harrison Edwards 和 Yura Burda 就上下文学习进行的讨论和实验,感谢 Geoffrey Irving 和 Paul Christiano 关于语言模型扩展的早期讨论,感谢 Long Ouyang 对人类评估实验设计的建议,感谢 Chris Hallacy 关于数据收集的讨论,感谢 Shan Carter 在视觉设计方面的帮助。感谢数以百万计创建了用于模型训练的内容的人们,以及参与对内容进行索引或投票的人们(对于 WebText 而言)。此外,我们还要感谢 OpenAI 的整个基础设施与超级计算团队,使得在这一规模上训练模型成为可能。

贡献

Tom Brown、Ben Mann、Prafulla Dhariwal、Dario Amodei、Nick Ryder、Daniel M Ziegler 和 Jeffrey Wu 实现了大规模模型、训练基础设施以及模型并行策略。

Tom Brown、Dario Amodei、Ben Mann 和 Nick Ryder 进行了Lua error: Internal error: The interpreter exited with status 1.实验。

Ben Mann 和 Alec Radford 收集、过滤、去重并对训练数据进行了重叠分析。

Melanie Subbiah、Ben Mann、Dario Amodei、Jared Kaplan、Sam McCandlish、Tom Brown、Tom Henighan 和 Girish Sastry 实现了下游任务及其软件支持框架,包括合成任务的创建。

Jared Kaplan 和 Sam McCandlish 最初预测一个巨型语言模型应当呈现持续的收益,并应用扩展定律帮助预测并指导本研究在模型与数据规模方面的决策。

Ben Mann 实现了训练期间的无放回采样。

Alec Radford 最初证明了语言模型中存在 few-shot 学习。

Jared Kaplan 和 Sam McCandlish 表明更大的模型在上下文中学习得更快,并系统地研究了上下文学习曲线、任务提示与评估方法。

Prafulla Dhariwal 实现了代码库的早期版本,并开发了完全半精度训练的内存优化。

Rewon Child 和 Mark Chen 开发了我们 model-parallel 策略的早期版本。

Rewon Child 和 Scott Gray 贡献了稀疏 Lua error: Internal error: The interpreter exited with status 1.。

Aditya Ramesh 实验了Lua error: Internal error: The interpreter exited with status 1.的损失缩放策略。

Melanie Subbiah 和 Arvind Neelakantan 实现、实验并测试了 beam search。

Pranav Shyam 负责 SuperGLUE 的工作,并协助建立与 few-shot learning 和 meta-learning 文献的联系。

Sandhini Agarwal 进行了公平性与代表性分析。

Girish Sastry 和 Amanda Askell 进行了模型的人类评估。

Ariel Herbert-Voss 进行了恶意使用的威胁分析。

Gretchen Krueger 编辑并以红队方式审查了论文的政策部分。

Benjamin Chess、Clemens Winter、Eric Sigler、Christopher Hesse、Mateusz Litwin 和 Christopher Berner 对 OpenAI 的集群进行了优化,以高效运行最大的模型。

Scott Gray 开发了训练期间使用的快速 GPU 内核。

Jack Clark 领导了对伦理影响的分析——公平性与表征、对模型的人类评估以及更广泛的影响分析——并就其工作向 Gretchen、Amanda、Girish、Sandhini 和 Ariel 提供建议。

Dario Amodei、Alec Radford、Tom Brown、Sam McCandlish、Nick Ryder、Jared Kaplan、Sandhini Agarwal、Amanda Askell、Girish Sastry 和 Jack Clark 撰写了论文。

Sam McCandlish 主导了模型扩展分析,并指导 Tom Henighan 和 Jared Kaplan 的工作。

Alec Radford 从 NLP 视角为项目提供指导,建议了任务,将结果置于上下文中,并展示了Lua error: Internal error: The interpreter exited with status 1.对训练的益处。

Ilya Sutskever 是大型生成似然模型扩展的早期倡导者,并指导了 Pranav、Prafulla、Rewon、Alec 和 Aditya 的工作。

Dario Amodei 设计并领导了这项研究。

附录 A Common Crawl 过滤细节

如 2.2 节所述,我们采用了两种技术来提升 Common Crawl 数据集的质量:(1) Common Crawl 过滤,(2) 模糊去重:

1.

为提升 Common Crawl 的质量,我们开发了一种自动过滤方法以移除低质量文档。我们以原始 WebText 作为高质量文档的代理,训练了一个分类器来将其与原始 Common Crawl 区分开。然后我们利用该分类器对 Common Crawl 重新采样,优先保留分类器预测为更高质量的文档。该分类器使用Lua error: Internal error: The interpreter exited with status 1.分类器,特征来自 Spark 的标准Lua error: Internal error: The interpreter exited with status 1.和 HashingTF¹⁰¹⁰10https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.HashingTF。对于正例,我们使用了一组精选数据集(如 WebText、维基百科以及我们的网页图书语料);对于负例,我们使用了未过滤的 Common Crawl。我们用该分类器为 Common Crawl 文档打分。当且仅当满足以下条件时,我们将该文档保留在数据集中:

${\mathtt{np.random.pareto}\hspace{0pt}{(\alpha)}} > {1 - \mathtt{document\_ score}}$

我们选择 ${\textstyle \alpha = 9}$ 以主要保留分类器评分较高的文档,同时仍保留一些分布外的文档。选择 ${\textstyle \alpha}$ 是为了匹配我们的分类器在 WebText 上分数的分布。我们发现这种重新加权提升了在一系列分布外生成文本样本上的损失所衡量的质量。
2.

为进一步提升模型质量并防止Lua error: Internal error: The interpreter exited with status 1.(随着模型容量增加,其重要性日益提升),我们使用 Spark 的 MinHashLSH 实现(10 个哈希,使用与上述分类相同的特征)对每个数据集中的文档进行模糊去重(即移除与其他文档高度重叠的文档)。我们还从 Common Crawl 中模糊地移除 WebText。总体而言,这使数据集规模平均减少了 10%。

在按重复和质量过滤之后,我们还部分移除了出现在基准数据集中的文本,详见附录 C。

附录 B 模型训练细节

为训练所有版本的 GPT-3,我们使用 Lua error: Internal error: The interpreter exited with status 1.,其中 ${\textstyle \beta_{1} = 0.9}$ 、 ${\textstyle \beta_{2} = 0.95}$ 以及 ${\textstyle \epsilon = 10^{- 8}}$ ,梯度的全局范数裁剪为 1.0,并对Lua error: Internal error: The interpreter exited with status 1.使用余弦衰减,使其在 2600 亿 token 内降至原值的 10%(2600 亿 token 之后,训练以原始Lua error: Internal error: The interpreter exited with status 1.的 10% 继续)。在前 3.75 亿 token 内,LR 进行线性预热。我们还根据模型规模,在训练的前 40 至 120 亿 token 期间,将 batch size 从一个较小的值(32k token)线性逐步增加到完整值。在训练过程中,数据采样不放回(直到达到一个 Lua error: Internal error: The interpreter exited with status 1. 边界)以最小化Lua error: Internal error: The interpreter exited with status 1.。所有模型都使用 0.1 的Lua error: Internal error: The interpreter exited with status 1.以提供少量Lua error: Internal error: The interpreter exited with status 1. [68]。

在训练过程中,我们始终在完整 ${\textstyle n_{ctx} = 2048}$ token 的上下文窗口序列上进行训练,当文档短于 2048 时,会将多个文档打包到一个序列中,以提升计算效率。对包含多个文档的序列,我们并不进行任何特殊的掩码处理,而是用特殊的文本结束 token 分隔同一序列内的文档,从而向语言模型提供足够信息,使其推断出由文本结束 token 分隔的上下文之间互不相关。这就实现了无需任何特殊的、序列专属掩码的高效训练。

附录 C 测试集污染研究细节

在 4.1 节中我们对测试集污染研究进行了高层次概述。本节我们提供方法论和结果的细节。

训练集初步过滤

我们尝试通过搜索本工作中使用的所有测试/开发集与我们训练数据之间 ${\textstyle 13 -}$ 元词组的重叠,来从训练数据中移除基准中出现的文本,我们移除了发生碰撞的 ${\textstyle 13 -}$ 元词组及其周围 200 个字符的窗口,将原始文档拆分为若干片段。出于过滤目的,我们将一个 gram 定义为以空白为分隔、去除标点的小写单词。长度不足 ${\textstyle 200}$ 个字符的片段被丢弃。被切分为超过 10 个片段的文档被视为污染,并被整体移除。最初,我们仅一次碰撞就将整个文档删除,但这会过度惩罚书籍等长文档,在虚假阳性方面尤为严重。一个虚假阳性的例子是基于维基百科的测试集——维基百科文章只引用某本书的一句话。我们忽略匹配超过 10 篇训练文档的 ${\textstyle 13 -}$ 元词组,因为检查表明它们大多包含通用的文化短语、法律样板或类似内容,这些是我们希望模型学习的,而非与测试集的不期望的具体重叠。各种频率下的示例可见 GPT-3 发布仓库¹¹¹¹11https://github.com/openai/gpt-3/blob/master/overlap_frequency.md。

重叠方法论

对于第 4.1 节的基准重叠分析,我们针对每个数据集使用一个可变的单词数 ${\textstyle N}$ 来检测重叠,其中 ${\textstyle N}$ 为忽略所有标点、空白和大小写后的样本长度的第 5 百分位。由于在较小 ${\textstyle N}$ 值下会出现虚假碰撞,在非合成任务上我们使用最小值 8。出于性能原因,我们对所有任务设置最大值为 13。 ${\textstyle N}$ 的值和被标记为脏数据的数据量见表 C.1。与 GPT-2 使用 Bloom 过滤器计算测试污染概率边界的做法不同,我们使用 Apache Spark 来计算所有训练集与测试集之间的精确碰撞。我们计算的是测试集与我们完整训练语料之间的重叠,尽管按第 2.2 节所述,我们只对过滤后的 Common Crawl 文档训练了 40%。

我们将“脏”样本定义为与任意训练文档存在任何 ${\textstyle N}$ -gram 重叠的样本,而将“干净”样本定义为不存在任何冲突的样本。

尽管部分测试切分未带标签,测试切分与验证切分的污染水平相近。由于本分析揭示出的一个错误,上述过滤在书籍等长文档上失败。出于成本考虑,在更正后的训练数据集上重新训练模型并不可行。因此,若干语言建模基准以及 Children's Book Test 出现了几乎完全的重叠,因此未被纳入本文。各重叠详见表 C.1。

名称	划分	指标	${\textstyle N}$	Acc/F1/BLEU	Total Count	Dirty Acc/F1/BLEU	Dirty Count	Clean Acc/F1/BLEU	Clean Count	Clean Percentage	Relative Difference Clean vs All
Quac	dev	f1	13	44.3	7353	44.3	7315	54.1	38	1%	20%
SQuADv2	dev	f1	13	69.8	11873	69.9	11136	68.4	737	6%	-2%
DROP	dev	f1	13	36.5	9536	37.0	8898	29.5	638	7%	-21%
Symbol Insertion	dev	acc	7	66.9	10000	66.8	8565	67.1	1435	14%	0%
CoQa	dev	f1	13	86.0	7983	85.3	5107	87.1	2876	36%	1%
ReCoRD	dev	acc	13	89.5	10000	90.3	6110	88.2	3890	39%	-1%
Winograd	test	acc	9	88.6	273	90.2	164	86.2	109	40%	-3%
BoolQ	dev	acc	13	76.0	3270	75.8	1955	76.3	1315	40%	0%
MultiRC	dev	acc	13	74.2	953	73.4	558	75.3	395	41%	1%
RACE-h	test	acc	13	46.8	3498	47.0	1580	46.7	1918	55%	0%
LAMBADA	test	acc	13	86.4	5153	86.9	2209	86.0	2944	57%	0%
LAMBADA (No Blanks)	test	acc	13	77.8	5153	78.5	2209	77.2	2944	57%	-1%
WSC	dev	acc	13	76.9	104	73.8	42	79.0	62	60%	3%
PIQA	dev	acc	8	82.3	1838	89.9	526	79.3	1312	71%	-4%
RACE-m	test	acc	13	58.5	1436	53.0	366	60.4	1070	75%	3%
De ${\textstyle \rightarrow}$ En 16	test	bleu-sb	12	43.0	2999	47.4	739	40.8	2260	75%	-5%
En ${\textstyle \rightarrow}$ De 16	test	bleu-sb	12	30.9	2999	32.6	739	29.9	2260	75%	-3%
En ${\textstyle \rightarrow}$ Ro 16	test	bleu-sb	12	25.8	1999	24.9	423	26.1	1576	79%	1%
Ro ${\textstyle \rightarrow}$ En 16	test	bleu-sb	12	41.3	1999	40.4	423	41.6	1576	79%	1%
WebQs	test	acc	8	41.5	2032	41.6	428	41.5	1604	79%	0%
ANLI R1	test	acc	13	36.8	1000	40.5	200	35.9	800	80%	-3%
ANLI R2	test	acc	13	34.0	1000	29.4	177	35.0	823	82%	3%
TriviaQA	dev	acc	10	71.2	7993	70.8	1390	71.3	6603	83%	0%
ANLI R3	test	acc	13	40.2	1200	38.3	196	40.5	1004	84%	1%
En ${\textstyle \rightarrow}$ Fr 14	test	bleu-sb	13	39.9	3003	38.3	411	40.3	2592	86%	1%
Fr ${\textstyle \rightarrow}$ En 14	test	bleu-sb	13	41.4	3003	40.9	411	41.4	2592	86%	0%
WiC	dev	acc	13	51.4	638	53.1	49	51.3	589	92%	0%
RTE	dev	acc	13	71.5	277	71.4	21	71.5	256	92%	0%
CB	dev	acc	13	80.4	56	100.0	4	78.8	52	93%	-2%
Anagrams 2	dev	acc	2	40.2	10000	76.2	705	37.4	9295	93%	-7%
Reversed Words	dev	acc	2	0.4	10000	1.5	660	0.3	9340	93%	-26%
OpenBookQA	test	acc	8	65.4	500	58.1	31	65.9	469	94%	1%
ARC (Easy)	test	acc	11	70.1	2268	77.5	89	69.8	2179	96%	0%
Anagrams 1	dev	acc	2	15.0	10000	49.8	327	13.8	9673	97%	-8%
COPA	dev	acc	9	93.0	100	100.0	3	92.8	97	97%	0%
ARC (Challenge)	test	acc	12	51.6	1144	45.2	31	51.8	1113	97%	0%
HellaSwag	dev	acc	13	79.3	10042	86.2	152	79.2	9890	98%	0%
NQs	test	acc	11	29.9	3610	32.7	52	29.8	3558	99%	0%
Cycled Letters	dev	acc	2	38.6	10000	20.5	73	38.7	9927	99%	0%
SAT Analogies	dev	acc	9	65.8	374	100.0	2	65.6	372	99%	0%
StoryCloze	test	acc	13	87.7	1871	100.0	2	87.6	1869	100%	0%
Winogrande	dev	acc	13	77.7	1267	-	0	77.7	1267	100%	0%

重叠结果

为了解模型见过部分数据对其在下游任务上的表现有多大帮助,我们按"脏度"过滤每个验证集和测试集。然后,我们只在干净的样本上进行评估,并报告干净分数与原始分数之间的相对百分比变化。若干净分数比整体分数低 1% 或 2% 以上,这提示模型可能对其见过的样本发生了过拟合。若干净分数显著更高,则我们的过滤方案可能优先地将更简单的样本标记为脏样本。

对于包含从网络抽取的背景信息(但不含答案)的数据集(如 SQuAD,其来源为维基百科),或长度不足 8 个词的样本(我们在过滤过程中将其忽略,字谜任务除外),该重叠指标往往呈现较高的虚假阳性率。该技术似乎无法提供良好信号的一个例子是 DROP——一项阅读理解任务,其中 94% 的样本被标为污染。回答问题所需的信息位于提供给模型的一段文章中,因此在训练时见过该段落但未见过问题与答案,并不构成实质性的作弊。我们已确认每个匹配的训练文档仅包含源段落,而不含数据集中的任何问题与答案。性能下降更可能的解释是:经过滤后剩余的 6% 样本与脏样本分布略有不同。

图 4.2 表明,随着数据集污染加重,干净/总体比值的方差增大,但并未出现明显偏向性能上升或下降的趋势。这表明 GPT-3 对污染相对不敏感。我们标记为需进一步审查的数据集详见第 4.1 节。

附录 D 训练语言模型所用的总计算量

本附录包含用于推导图 2.2 中训练各语言模型大致所用算力的计算。作为简化假设,我们忽略Lua error: Internal error: The interpreter exited with status 1.操作,因为对我们所分析的模型而言,该操作通常占总算力不足 10%。

计算可在表 D.1 中查看,表说明中有具体解释。

Model	总训练计算量(PF-天)	总训练计算量(flops)	Params (M)	Training tokens (billions)	Flops per param per token	Mult for bwd pass	Fwd-pass flops per active param per token	Frac of params active for each token
T5-Small	2.08E+00	1.80E+20	60	1,000	3	3	1	0.5
T5-Base	7.64E+00	6.60E+20	220	1,000	3	3	1	0.5
T5-Large	2.67E+01	2.31E+21	770	1,000	3	3	1	0.5
T5-3B	1.04E+02	9.00E+21	3,000	1,000	3	3	1	0.5
T5-11B	3.82E+02	3.30E+22	11,000	1,000	3	3	1	0.5
BERT-Base	1.89E+00	1.64E+20	109	250	6	3	2	1.0
BERT-Large	6.16E+00	5.33E+20	355	250	6	3	2	1.0
RoBERTa-Base	1.74E+01	1.50E+21	125	2,000	6	3	2	1.0
RoBERTa-Large	4.93E+01	4.26E+21	355	2,000	6	3	2	1.0
GPT-3 Small	2.60E+00	2.25E+20	125	300	6	3	2	1.0
GPT-3 Medium	7.42E+00	6.41E+20	356	300	6	3	2	1.0
GPT-3 Large	1.58E+01	1.37E+21	760	300	6	3	2	1.0
GPT-3 XL	2.75E+01	2.38E+21	1,320	300	6	3	2	1.0
GPT-3 2.7B	5.52E+01	4.77E+21	2,650	300	6	3	2	1.0
GPT-3 6.7B	1.39E+02	1.20E+22	6,660	300	6	3	2	1.0
GPT-3 13B	2.68E+02	2.31E+22	12,850	300	6	3	2	1.0
GPT-3 175B	3.64E+03	3.14E+23	174,600	300	6	3	2	1.0

附录 E 合成新闻文章的人类质量评估

本附录详细介绍了测量人类区分 GPT-3 生成的合成新闻文章与真实新闻文章能力的实验。我们首先描述关于 ${\textstyle \sim 200}$ 词新闻文章的实验,然后描述关于 GPT-3 生成的 ${\textstyle \sim 500}$ 词新闻文章的初步调查。

参与者:我们招募了 718 名独立参与者参加 6 项实验。97 名参与者因未通过一道互联网检查题目而被排除,剩下共计 621 名参与者:343 名男性、271 名女性和 7 名其他性别。参与者平均年龄 ${\textstyle \sim 38}$ 岁。所有参与者通过 Positly 招募,该平台维护着 Mechanical Turk 中高表现工作者的白名单。所有参与者均位于美国,但没有其他人口统计限制。参与者获得 12 美元报酬,该报酬基于试点运行确定的 60 分钟任务时长估算。为确保每项实验问卷的参与者样本独立,参与者不得多次参加同一实验。

流程与设计:我们任意选取了 25 篇 2020 年初出现在 newser.com 上的新闻文章。我们使用这些文章的标题和副标题,从参数量分别为 1.25 亿、3.5 亿、7.6 亿、13 亿、27 亿、67 亿、130 亿和 200B(GPT-3)的语言模型生成输出。每个模型对每个问题生成 5 个输出,自动选择字数与人写文章最接近的一篇。这样做是为了最小化续写长度对参与者判断的影响。每个模型的输出流程相同,但与正文所述一样,移除了故意写差的对照模型。

在每项实验中,一半参与者被随机分配到问卷 A,一半被分配到问卷 B。每份问卷包含 25 篇文章:一半(12-13 篇)为人写文章,一半(12-13 篇)为模型生成文章:问卷 A 中由人撰写续写的文章在问卷 B 中为模型生成的续写,反之亦然。问卷问题的顺序对每位参与者打乱。参与者可以留下评论,并被询问其是否之前看过这些文章。参与者被指示在问卷过程中不要搜索这些文章或其内容,问卷结束时还会被询问其是否在问卷期间搜索过任何内容。

Model	Participants Recruited	Participants Excluded	Genders (m:f:other)	Mean Age	Average Word Count (human:model)
Control	76	7	32:37:0	39	216:216
GPT-3 Small	80	7	41:31:1	40	216:188
GPT-3 Medium	80	7	46:28:2	39	216:202
GPT-3 Large	81	24	46:28:2	37	216:200
GPT-3 XL	79	14	32:32:1	38	216:199
GPT-3 2.7B	80	11	36:33:0	40	216:202
GPT-3 6.7B	76	5	46:28:2	37	216:195
GPT-3 13.0B	81	13	46:28:2	37	216:209
GPT-3 175B	80	9	42:29:0	37	216:216

统计检验:为比较各次运行的均值,我们针对每个模型与对照模型分别进行了独立组别的双样本 t 检验。该检验在 Python 中通过 scipy.stats.ttest_ind 函数实现。在绘制参与者平均准确率与模型规模关系的回归线时,我们拟合了形如 ${\textstyle a\hspace{0pt}x^{- b}}$ 的幂律。95% 置信区间通过样本均值的 t 分布估计。

时长统计:在正文中,我们讨论了一个发现:随着模型变大,人类参与者区分模型生成与人写新闻文章的能力下降。我们还发现,如图 E.1 所示,对一组给定问题所花费的平均时间随着模型规模增大而增加。尽管参与者投入的时间更多,但准确率更低,这印证了"更大的模型生成的新闻文章更难区分"这一发现。

关于 ${\textstyle \sim 500}$ 词文章的初步调查:我们通过 Positly 招募了 160 名美国本土的独立参与者参加 2 项实验(详细信息见表 E.2)。我们从 2019 年末的路透社国际新闻中随机选取 12 篇文章,并为 GPT-3 175B 构建一个上下文——该上下文由不在这 12 篇文章中的另一篇路透社文章组成。然后,我们使用文章标题和路透社的地点信息,从 GPT-3 175B 与此前实验中使用的 160M 对照模型生成续写。我们用其为每个模型构建两份各 12 题的问卷,每份问卷一半为人写文章、一半为模型生成文章。我们加入了理解题,并以每隔 30 秒分 3 阶段的方式向参与者展示文章,以鼓励更细致的阅读。参与者完成此任务获得 12 美元报酬。模型生成的选择方法、排除标准和统计检验均与此前实验保持一致。

Model	Participants Recruited	Participants Excluded	Genders (m:f:other)	Mean Age	Average Word Count (human:model)
Control	79	17	32:37:0	39	569:464
GPT-3 175B	81	19	32:30:0	40	569:498

附录 F GPT-3 的更多样本

GPT-3 对正文中所探讨任务之外的许多任务也适应良好。例如,在图 F.1 中,我们展示了来自一个提示的四个未经筛选样本——该提示要求模型以华莱士·史蒂文斯(Wallace Stevens)的风格,围绕一个给定标题写一首诗。我们首先尝试了若干提示,随后生成了四个样本而未进行任何额外编辑或挑选(在温度 ${\textstyle 1}$ 下使用核采样 [40], ${\textstyle P = 0.9}$ )。当模型开始写一个新的标题与作者抬头,或转入散文式评论时,续写被截断。

`Context` ${\textstyle \rightarrow}$	`The City`
	`BY C. P. CAVAFY`
	`TRANSLATED BY EDMUND KEELEY`
	`[Poem text omitted]`
	`SOME TREES`
	`John Ashbery`
	`[Poem text omitted]`
	`Shadows on the Way`
	`Wallace Stevens`
`-------- Generated Poem 1 --------`	`-------- Generated Poem 3 --------`
`I must have shadows on the way`	`The sun was all we had. Now, in the shade`
`If I am to walk I must have`	`All is changed. The mind must dwell on those`
`Each step taken slowly and alone`	`White fields, that to its eyes were always old;`
`To have it ready made`	`Those ancient gleams, convoluting`
	`The way they lay among the huge roots,`
`And I must think in lines of grey`	`The great dark tomes of reverie,`
`To have dim thoughts to be my guide`	`The plumed labyrinth of the sea.`
`Must look on blue and green`
`And never let my eye forget`	`When our own breath gonne faint and fade`
`That color is my friend`	`Up the long stairway of the dead`
`And purple must surround me too`	`We shall remember well`
	`The blindness of the light, the whiteness`
`The yellow of the sun is no more`	`Of that white land. The footsteps, and`
`Intrusive than the bluish snow`	`The grasses in that light, and in the shade`
`That falls on all of us. I must have`	`The menacing life.`
`Grey thoughts and blue thoughts walk with me`
`If I am to go away at all.`	`We played, too, once, in the warmest rooms.`
	`To one content, there is one content`
`-------- Generated Poem 2 --------`	`For all mankind. The forest glades`
	`Are all the more for being in fable,`
`There is a way in the world.`	`And wood and town in legend shadowed-`
`That way a man may live in a small town`	`Once more to return. The one content`
`His lifetime and see no man come,`	`Has in it every happiness, more brave`
`Never the stir of men about a market,`	`For being purposeless; and so`
`Never the bobbing of covered carts,`	`Those island gardens, deep in the mind,`
`Never the creak passed on the stair`	`Make of remoteness familiarity`
`Closing some other life; an old man`	`And of the unspoken name a friend.`
`Can prosper in that town, can prosper`
`When his store is seen to close, his wife`	`-------- Generated Poem 4 --------`
`Fades in his house, and the world shouts in vain`
`That no one comes to hear him`	`Nobody will come to this place. It is a road that leads nowhere.`
`Where his small bell tolls.`	`The solitude is deep. The mountains are high.`
`In that small town`	`But they are desolate, and they turn the traveler’s face`
`Everything is known.`	`Towards the North. All the sounds of the world are far away.`
`The ways of the young men are known,`	`When the wind rises above the trees,`
`And the ways of the young women`	`The boughs bow to the ground.`
`As they walk to the Spring;`	`Even the birds that inhabit the tangle of weeds`
`The rumors are all known,`	`That is the roadside cover, are silent. One listens,`
`The quarrels, the hurt, the envy`	`But hears no roar of the forest. One is alone.`
`Of the men and the women.`	`One will be taken.`
`Till a day comes`	`One will be taken.`
`That way out of the world`	`There is no utterance, there is no conversation,`
`That leads to a wood.`	`But one is uneasy all the same....`
`Young, the young man`	`There is a thin blue mist,`
`Walks, a lover now,`	`A darkness rising like smoke,`
`Knows the streets as they are and is of them.`	`And within that darkness`
	`A possession of the heart.`
`Comes to the wood.`	`One will be taken.... It was here, and it will be here again-`
	`Here, under this sky empty and full of light.`
`There, in the wood, among trees,`
`He sees shadows on the way, hears voices,`
`hears the wind and the rustling of leaves;`
`Through an open glade`
`He sees a shape and the shape hears:`
`It waits as he waits,`
`Waits as the shadows wait,`
`As the voices wait;`
`Shadows on the way, voices in the wind.`

附录 G 任务表述与规格细节

下列图示说明了本文所含全部任务的格式与表述。所有数据均来源于本节的真实数据集,此处不包含来自 GPT-3 的样本。

`Context` ${\textstyle \rightarrow}$	`Article:`
	Informal conversation is an important part of any business relationship.Before you start a discussion,however,make sure you understand which topics are suitable and which are considered taboo in a particular culture. Latin Americans enjoy sharing information about their local history, art and customs.You may expect questions about your family,and be sure to show pictures of your children.You may feel free to ask similar questions of your Latin American friends.The French think of conversation as an art form,and they enjoy the value of lively discussions as well as disagreements. For them,arguments can be interesting and they can cover pretty much or any topic ---- as long as they occur in are respectful and intelligent manner.
	In the United States,business people like to discuss a wide range of topics,including opinions about work,family,hobbies,and politics. In Japan,China,and Korea,however,people are much more private.They do not share much about their thoughts,feelings,or emotions because they feel that doing so might take away from the harmonious business relationship they’re trying to build.Middle Easterners are also private about their personal lives and family matters.It is considered rude,for example,to ask a businessman from Saudi Arabia about his wife or children.
	`As a general rule,it’s best not to talk about politics or religion with your business friends.This can get you into trouble,even in the United States,where people hold different religious views.In addition,discussing one’s salary is usually considered unsuitable.Sports is typically a friendly subject in most parts of the world,although be careful not to criticize national sport.Instead,be friendly and praise your host’s team.`
	`Q: What shouldn’t you do when talking about sports with colleagues from another country?`
	`A: Criticizing the sports of your colleagues’ country.`
	`Q: Which is typically a friendly topic in most places according to the author?`
	`A: Sports.`
	`Q: Why are people from Asia more private in their conversation with others?`
	`A: They don’t want to have their good relationship with others harmed by informal conversation.`
	`Q: The author considers politics and religion _ .`
	`A:`
`Correct Answer` ${\textstyle \rightarrow}$	`taboo`
`Incorrect Answer` ${\textstyle \rightarrow}$	`cheerful topics`
`Incorrect Answer` ${\textstyle \rightarrow}$	`rude topics`
`Incorrect Answer` ${\textstyle \rightarrow}$	`topics that can never be talked about`

`Context` ${\textstyle \rightarrow}$	`anli 2: anli 2: The Gold Coast Hotel & Casino is a hotel and casino located in Paradise, Nevada. This locals’ casino is owned and operated by Boyd Gaming. The Gold Coast is located one mile (` ${\textstyle \sim {1.6\hspace{0pt}{km}}}$ `) west of the Las Vegas Strip on West Flamingo Road. It is located across the street from the Palms Casino Resort and the Rio All Suite Hotel and Casino.`
	`Question: The Gold Coast is a budget-friendly casino. True, False, or Neither?`
`Correct Answer` ${\textstyle \rightarrow}$	`Neither`
`Incorrect Answer` ${\textstyle \rightarrow}$	`True`
`Incorrect Answer` ${\textstyle \rightarrow}$	`False`

`Context` ${\textstyle \rightarrow}$	`Article:`
	`Mrs. Smith is an unusual teacher. Once she told each student to bring along a few potatoes in plastic bag. On each potato the students had to write a name of a person that they hated And the next day, every child brought some potatoes. Some had two potatoes;some three;some up to five.`
	`Mrs. Smith then told the children to carry the bags everywhere they went, even to the toilet, for two weeks. As day after day passed, the children started to complain about the awful smell of the rotten potatoes.`
	`Those children who brought five potatoes began to feel the weight trouble of the bags. After two weeks, the children were happy to hear that the game was finally ended. Mrs. Smith asked,"How did you feel while carrying the potatoes for two weeks?" The children started complaining about the trouble loudly.`
	Then Mrs. Smith told them why she asked them to play the game. She said,"This is exactly the situation when you carry your hatred for somebody inside your heart. The terrible smell of the hatred will pollute your heart and you will carry something unnecessary with you all the time. If you cannot stand the smell of the rotten potatoes for just two weeks, can you imagine how heavy it would be to have the hatred in your heart for your lifetime? So throw away any hatred from your heart, and you’ll be really happy."
	`Q: Which of the following is True according to the passage?`
	`A: If a kid hated four people,he or she had to carry four potatoes.`
	`Q: We can learn from the passage that we should _ .`
	`A: throw away the hatred inside`
	`Q: The children complained about _ besides the weight trouble.`
	`A: the smell`
	`Q: Mrs.Smith asked her students to write _ on the potatoes.`
	`A:`
`Correct Answer` ${\textstyle \rightarrow}$	`names`
`Incorrect Answer` ${\textstyle \rightarrow}$	`numbers`
`Incorrect Answer` ${\textstyle \rightarrow}$	`time`
`Incorrect Answer` ${\textstyle \rightarrow}$	`places`

`Context` ${\textstyle \rightarrow}$	`How to apply sealant to wood.`
`Correct Answer` ${\textstyle \rightarrow}$	`Using a brush, brush on sealant onto wood until it is fully saturated with the sealant.`
`Incorrect Answer` ${\textstyle \rightarrow}$	`Using a brush, drip on sealant onto wood until it is fully saturated with the sealant.`

`Context` ${\textstyle \rightarrow}$	`My body cast a shadow over the grass because`
`Correct Answer` ${\textstyle \rightarrow}$	`the sun was rising.`
`Incorrect Answer` ${\textstyle \rightarrow}$	`the grass was cut.`

`Context` ${\textstyle \rightarrow}$	(CNN) Yuval Rabin, whose father, Yitzhak Rabin, was assassinated while serving as Prime Minister of Israel, criticized Donald Trump for appealing to "Second Amendment people" in a speech and warned that the words that politicians use can incite violence and undermine democracy. "Trump’s words are an incitement to the type of political violence that touched me personally," Rabin wrote in USAToday. He said that Trump’s appeal to "Second Amendment people" to stop Hillary Clinton -- comments that were criticized as a call for violence against Clinton, something Trump denied -- "were a new level of ugliness in an ugly campaign season."
	`- The son of a former Israeli Prime Minister who was assassinated wrote an op ed about the consequence of violent political rhetoric.`
	`- Warns of "parallels" between Israel of the 1990s and the U.S. today.`
`Correct Answer` ${\textstyle \rightarrow}$	`- Referencing his father, who was shot and killed by an extremist amid political tension in Israel in 1995, Rabin condemned Donald Trump’s aggressive rhetoric.`
`Correct Answer` ${\textstyle \rightarrow}$	`- Referencing his father, who was shot and killed by an extremist amid political tension in Israel in 1995, Rabin condemned Trump’s aggressive rhetoric.`
`Incorrect Answer` ${\textstyle \rightarrow}$	`- Referencing his father, who was shot and killed by an extremist amid political tension in Israel in 1995, Rabin condemned Hillary Clinton’s aggressive rhetoric.`
`Incorrect Answer` ${\textstyle \rightarrow}$	`- Referencing his father, who was shot and killed by an extremist amid political tension in Israel in 1995, Rabin condemned U.S.’s aggressive rhetoric.`
`Incorrect Answer` ${\textstyle \rightarrow}$	`- Referencing his father, who was shot and killed by an extremist amid political tension in Israel in 1995, Rabin condemned Yitzhak Rabin’s aggressive rhetoric.`

`Context` ${\textstyle \rightarrow}$	`anli 1: anli 1: Fulton James MacGregor MSP is a Scottish politician who is a Scottish National Party (SNP) Member of Scottish Parliament for the constituency of Coatbridge and Chryston. MacGregor is currently Parliamentary Liaison Officer to Shona Robison, Cabinet Secretary for Health & Sport. He also serves on the Justice and Education & Skills committees in the Scottish Parliament.`
	`Question: Fulton James MacGregor is a Scottish politican who is a Liaison officer to Shona Robison who he swears is his best friend. True, False, or Neither?`
`Correct Answer` ${\textstyle \rightarrow}$	`Neither`
`Incorrect Answer` ${\textstyle \rightarrow}$	`True`
`Incorrect Answer` ${\textstyle \rightarrow}$	`False`

`Context` ${\textstyle \rightarrow}$	`Organisms require energy in order to do what?`
`Correct Answer` ${\textstyle \rightarrow}$	`mature and develop.`
`Incorrect Answer` ${\textstyle \rightarrow}$	`rest soundly.`
`Incorrect Answer` ${\textstyle \rightarrow}$	`absorb light.`
`Incorrect Answer` ${\textstyle \rightarrow}$	`take in nutrients.`

`Context` ${\textstyle \rightarrow}$	`Making a cake: Several cake pops are shown on a display. A woman and girl are shown making the cake pops in a kitchen. They`
`Correct Answer` ${\textstyle \rightarrow}$	`bake them, then frost and decorate.`
`Incorrect Answer` ${\textstyle \rightarrow}$	`taste them as they place them on plates.`
`Incorrect Answer` ${\textstyle \rightarrow}$	`put the frosting on the cake as they pan it.`
`Incorrect Answer` ${\textstyle \rightarrow}$	`come out and begin decorating the cake as well.`

`Context` ${\textstyle \rightarrow}$	`anli 3: anli 3: We shut the loophole which has American workers actually subsidizing the loss of their own job. They just passed an expansion of that loophole in the last few days: $43 billion of giveaways, including favors to the oil and gas industry and the people importing ceiling fans from China.`
	`Question: The loophole is now gone True, False, or Neither?`
`Correct Answer` ${\textstyle \rightarrow}$	`False`
`Incorrect Answer` ${\textstyle \rightarrow}$	`True`
`Incorrect Answer` ${\textstyle \rightarrow}$	`Neither`

`Context` ${\textstyle \rightarrow}$	`Question: George wants to warm his hands quickly by rubbing them. Which skin surface will produce the most heat?`
	`Answer:`
`Correct Answer` ${\textstyle \rightarrow}$	`dry palms`
`Incorrect Answer` ${\textstyle \rightarrow}$	`wet palms`
`Incorrect Answer` ${\textstyle \rightarrow}$	`palms covered with oil`
`Incorrect Answer` ${\textstyle \rightarrow}$	`palms covered with lotion`

`Context` ${\textstyle \rightarrow}$	`lull is to trust as`
`Correct Answer` ${\textstyle \rightarrow}$	`cajole is to compliance`
`Incorrect Answer` ${\textstyle \rightarrow}$	`balk is to fortitude`
`Incorrect Answer` ${\textstyle \rightarrow}$	`betray is to loyalty`
`Incorrect Answer` ${\textstyle \rightarrow}$	`hinder is to destination`
`Incorrect Answer` ${\textstyle \rightarrow}$	`soothe is to passion`

`Correct Context` ${\textstyle \rightarrow}$	`Grace was happy to trade me her sweater for my jacket. She thinks the sweater`
`Incorrect Context` ${\textstyle \rightarrow}$	`Grace was happy to trade me her sweater for my jacket. She thinks the jacket`
`Target Completion` ${\textstyle \rightarrow}$	`looks dowdy on her.`

`Correct Context` ${\textstyle \rightarrow}$	`Johnny likes fruits more than vegetables in his new keto diet because the fruits`
`Incorrect Context` ${\textstyle \rightarrow}$	`Johnny likes fruits more than vegetables in his new keto diet because the vegetables`
`Target Completion` ${\textstyle \rightarrow}$	`are saccharine.`

`Context` ${\textstyle \rightarrow}$	`READING COMPREHENSION ANSWER KEY`
	While this process moved along, diplomacy continued its rounds. Direct pressure on the Taliban had proved unsuccessful. As one NSC staff note put it, "Under the Taliban, Afghanistan is not so much a state sponsor of terrorism as it is a state sponsored by terrorists." In early 2000, the United States began a high-level effort to persuade Pakistan to use its influence over the Taliban. In January 2000, Assistant Secretary of State Karl Inderfurth and the State Department’s counterterrorism coordinator, Michael Sheehan, met with General Musharraf in Islamabad, dangling before him the possibility of a presidential visit in March as a reward for Pakistani cooperation. Such a visit was coveted by Musharraf, partly as a sign of his government’s legitimacy. He told the two envoys that he would meet with Mullah Omar and press him on Bin Laden. They left, however, reporting to Washington that Pakistan was unlikely in fact to do anything," given what it sees as the benefits of Taliban control of Afghanistan." President Clinton was scheduled to travel to India. The State Department felt that he should not visit India without also visiting Pakistan. The Secret Service and the CIA, however, warned in the strongest terms that visiting Pakistan would risk the President’s life. Counterterrorism officials also argued that Pakistan had not done enough to merit a presidential visit. But President Clinton insisted on including Pakistan in the itinerary for his trip to South Asia. His one-day stopover on March 25, 2000, was the first time a U.S. president had been there since 1969. At his meeting with Musharraf and others, President Clinton concentrated on tensions between Pakistan and India and the dangers of nuclear proliferation, but also discussed Bin Laden. President Clinton told us that when he pulled Musharraf aside for a brief, one-on-one meeting, he pleaded with the general for help regarding Bin Laden." I offered him the moon when I went to see him, in terms of better relations with the United States, if he’d help us get Bin Laden and deal with another issue or two." The U.S. effort continued.
	`Who did The State Department feel should visit both India and Pakistan?`
`Correct Answer` ${\textstyle \rightarrow}$	`- [False] Bin Laden`
`Incorrect Answer` ${\textstyle \rightarrow}$	`- [True] Bin Laden`

`Context` ${\textstyle \rightarrow}$	`Question: Which factor will most likely cause a person to develop a fever?`
	`Answer:`
`Correct Answer` ${\textstyle \rightarrow}$	`a bacterial population in the bloodstream`
`Incorrect Answer` ${\textstyle \rightarrow}$	`a leg muscle relaxing after exercise`
`Incorrect Answer` ${\textstyle \rightarrow}$	`several viral particles on the skin`
`Incorrect Answer` ${\textstyle \rightarrow}$	`carbohydrates being digested in the stomach`

`Context` ${\textstyle \rightarrow}$	`Bob went to the gas station to fill up his car. His tank was completely empty and so was his wallet. The cashier offered to pay for his gas if he came back later to pay. Bob felt grateful as he drove home.`
`Correct Answer` ${\textstyle \rightarrow}$	`Bob believed that there were good people in the world.`
`Incorrect Answer` ${\textstyle \rightarrow}$	`Bob contemplated how unfriendly the world was.`

`Context` ${\textstyle \rightarrow}$	`Helsinki is the capital and largest city of Finland. It is in the region of Uusimaa, in southern Finland, on the shore of the Gulf of Finland. Helsinki has a population of , an urban population of , and a metropolitan population of over 1.4 million, making it the most populous municipality and urban area in Finland. Helsinki is some north of Tallinn, Estonia, east of Stockholm, Sweden, and west of Saint Petersburg, Russia. Helsinki has close historical connections with these three cities.`
	The Helsinki metropolitan area includes the urban core of Helsinki, Espoo, Vantaa, Kauniainen, and surrounding commuter towns. It is the world’s northernmost metro area of over one million people, and the city is the northernmost capital of an EU member state. The Helsinki metropolitan area is the third largest metropolitan area in the Nordic countries after Stockholm and Copenhagen, and the City of Helsinki is the third largest after Stockholm and Oslo. Helsinki is Finland’s major political, educational, financial, cultural, and research center as well as one of northern Europe’s major cities. Approximately 75% of foreign companies that operate in Finland have settled in the Helsinki region. The nearby municipality of Vantaa is the location of Helsinki Airport, with frequent service to various destinations in Europe and Asia.
	`Q: what is the most populous municipality in Finland?`
	`A: Helsinki`
	`Q: how many people live there?`
	`A: 1.4 million in the metropolitan area`
	`Q: what percent of the foreign companies that operate in Finland are in Helsinki?`
	`A: 75%`
	`Q: what towns are a part of the metropolitan area?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`Helsinki, Espoo, Vantaa, Kauniainen, and surrounding commuter towns`

`Context` ${\textstyle \rightarrow}$	`Please unscramble the letters into a word, and write that word:`
	`asinoc =`
`Target Completion` ${\textstyle \rightarrow}$	`casino`

`Context` ${\textstyle \rightarrow}$	Passage: Saint Jean de Brébeuf was a French Jesuit missionary who travelled to New France in 1625. There he worked primarily with the Huron for the rest of his life, except for a few years in France from 1629 to 1633. He learned their language and culture, writing extensively about each to aid other missionaries. In 1649, Brébeuf and another missionary were captured when an Iroquois raid took over a Huron village . Together with Huron captives, the missionaries were ritually tortured and killed on March 16, 1649. Brébeuf was beatified in 1925 and among eight Jesuit missionaries canonized as saints in the Roman Catholic Church in 1930.
	`Question: How many years did Saint Jean de Brébeuf stay in New France before he went back to France for a few years?`
	`Answer:`
`Target Completion` ${\textstyle \rightarrow}$	`4`

`Context` ${\textstyle \rightarrow}$	`Fill in blank:`
	`She held the torch in front of her.`
	`She caught her breath.`
	`"Chris? There’s a step."`
	`"What?"`
	`"A step. Cut in the rock. About fifty feet ahead." She moved faster. They both moved faster. "In fact," she said, raising the torch higher, "there’s more than a ____. -` ${\textstyle >}$
`Target Completion` ${\textstyle \rightarrow}$	`step`

`Context` ${\textstyle \rightarrow}$	`Please unscramble the letters into a word, and write that word:`
	`skicts =`
`Target Completion` ${\textstyle \rightarrow}$	`sticks`

`Context` ${\textstyle \rightarrow}$	`Please unscramble the letters into a word, and write that word:`
	`volwskagen =`
`Target Completion` ${\textstyle \rightarrow}$	`volkswagen`

`Context` ${\textstyle \rightarrow}$	`Q: Who played tess on touched by an angel?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`Delloreese Patricia Early (July 6, 1931 – November 19, 2017), known professionally as Della Reese`

`Context` ${\textstyle \rightarrow}$	`TITLE: William Perry (American football) - Professional career`
	PARAGRAPH: In 1985, he was selected in the first round of the 1985 NFL Draft by the Chicago Bears; he had been hand-picked by coach Mike Ditka. However, defensive coordinator Buddy Ryan, who had a highly acrimonious relationship with Ditka, called Perry a "wasted draft-pick". Perry soon became a pawn in the political power struggle between Ditka and Ryan. Perry’s "Refrigerator" nickname followed him into the NFL and he quickly became a favorite of the Chicago Bears fans. Teammates called him "Biscuit," as in "one biscuit shy of 350 pounds." While Ryan refused to play Perry, Ditka decided to use Perry as a fullback when the team was near the opponents’ goal line or in fourth and short situations, either as a ball carrier or a lead blocker for star running back Walter Payton. Ditka stated the inspiration for using Perry as a fullback came to him during five-yard sprint exercises. During his rookie season, Perry rushed for two touchdowns and caught a pass for one. Perry even had the opportunity to run the ball during Super Bowl XX, as a nod to his popularity and contributions to the team’s success. The first time he got the ball, he was tackled for a one-yard loss while attempting to throw his first NFL pass on a halfback option play. The second time he got the ball, he scored a touchdown (running over Patriots linebacker Larry McGrew in the process). About halfway through his rookie season, Ryan finally began to play Perry, who soon proved that he was a capable defensive lineman. His Super Bowl ring size is the largest of any professional football player in the history of the event. His ring size is 25, while the ring size for the average adult male is between 10 and 12. Perry went on to play for ten years in the NFL, retiring after the 1994 season. In his ten years as a pro, he regularly struggled with his weight, which hampered his performance at times. He played in 138 games, recording 29.5 sacks and five fumble recoveries, which he returned for a total of 71 yards. In his offensive career he ran five yards for two touchdowns, and had one reception for another touchdown. Perry later attempted a comeback, playing an unremarkable 1996 season with the London Monarchs of the World League of American Football (later NFL Europa).
	`Q: what team did he play for?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`the Chicago Bears`

`Context` ${\textstyle \rightarrow}$	`Please unscramble the letters into a word, and write that word:`
	`r e!c.i p r o.c a/l =`
`Target Completion` ${\textstyle \rightarrow}$	`reciprocal`

`Context` ${\textstyle \rightarrow}$	`Please unscramble the letters into a word, and write that word:`
	`taefed =`
`Target Completion` ${\textstyle \rightarrow}$	`defeat`

`Context` ${\textstyle \rightarrow}$	`Title: The_Blitz`
	Background: From the German point of view, March 1941 saw an improvement. The Luftwaffe flew 4,000 sorties that month, including 12 major and three heavy attacks. The electronic war intensified but the Luftwaffe flew major inland missions only on moonlit nights. Ports were easier to find and made better targets. To confuse the British, radio silence was observed until the bombs fell. X- and Y-Gerät beams were placed over false targets and switched only at the last minute. Rapid frequency changes were introduced for X-Gerät, whose wider band of frequencies and greater tactical flexibility ensured it remained effective at a time when British selective jamming was degrading the effectiveness of Y-Gerät.
	`Q: How many sorties were flown in March 1941?`
	`A: 4,000`
	`Q: When did the Luftwaffe fly inland missions?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`only on moonlit nights`

`Context` ${\textstyle \rightarrow}$	Normal force -- In a simple case such as an object resting upon a table, the normal force on the object is equal but in opposite direction to the gravitational force applied on the object (or the weight of the object), that is, N = m g (\displaystyle N=mg), where m is mass, and g is the gravitational field strength (about 9.81 m/s on Earth). The normal force here represents the force applied by the table against the object that prevents it from sinking through the table and requires that the table is sturdy enough to deliver this normal force without breaking. However, it is easy to assume that the normal force and weight are action-reaction force pairs (a common mistake). In this case, the normal force and weight need to be equal in magnitude to explain why there is no upward acceleration of the object. For example, a ball that bounces upwards accelerates upwards because the normal force acting on the ball is larger in magnitude than the weight of the ball.
	`question: is the normal force equal to the force of gravity?`
	`answer:`
`Target Completion` ${\textstyle \rightarrow}$	`yes`

`Context` ${\textstyle \rightarrow}$	`The trend toward lower rents may seem surprising given that some communities in New York are bemoaning the loss of favorite local businesses to high rents. But, despite the recent softening, for many of these retailers there’s still been too big a jump from the rental rates of the late 1970s, when their leases were signed. Certainly, the recent drop in prices doesn’t mean Manhattan comes cheap.`
	`question: Manhattan comes cheap. true, false, or neither?`
	`answer:`
`Target Completion` ${\textstyle \rightarrow}$	`false`

`Context` ${\textstyle \rightarrow}$	`The bet, which won him dinner for four, was regarding the existence and mass of the top quark, an elementary particle discovered in 1995.`
	`question: The Top Quark is the last of six flavors of quarks predicted by the standard model theory of particle physics. True or False?`
	`answer:`
`Target Completion` ${\textstyle \rightarrow}$	`False`

`Context` ${\textstyle \rightarrow}$	`An outfitter provided everything needed for the safari.`
	`Before his first walking holiday, he went to a specialist outfitter to buy some boots.`
	`question: Is the word ‘outfitter’ used in the same way in the two sentences above?`
	`answer:`
`Target Completion` ${\textstyle \rightarrow}$	`no`

`Context` ${\textstyle \rightarrow}$	`Final Exam with Answer Key`
	`Instructions: Please carefully read the following passages. For each passage, you must identify which noun the pronoun marked in bold refers to.`
	`=====`
	`Passage: Mr. Moncrieff visited Chester’s luxurious New York apartment, thinking that it belonged to his son Edward. The result was that Mr. Moncrieff has decided to cancel Edward’s allowance on the ground that he no longer requires his financial support.`
	`Question: In the passage above, what does the pronoun "his" refer to?`
	`Answer:`
`Target Completion` ${\textstyle \rightarrow}$	`mr. moncrieff`

`Context` ${\textstyle \rightarrow}$	`Q: ‘Nude Descending A Staircase’ is perhaps the most famous painting by which 20th century artist?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`MARCEL DUCHAMP`
`Target Completion` ${\textstyle \rightarrow}$	`r mutt`
`Target Completion` ${\textstyle \rightarrow}$	`duchamp`
`Target Completion` ${\textstyle \rightarrow}$	`marcel duchamp`
`Target Completion` ${\textstyle \rightarrow}$	`R.Mutt`
`Target Completion` ${\textstyle \rightarrow}$	`Marcel duChamp`
`Target Completion` ${\textstyle \rightarrow}$	`Henri-Robert-Marcel Duchamp`
`Target Completion` ${\textstyle \rightarrow}$	`Marcel du Champ`
`Target Completion` ${\textstyle \rightarrow}$	`henri robert marcel duchamp`
`Target Completion` ${\textstyle \rightarrow}$	`Duchampian`
`Target Completion` ${\textstyle \rightarrow}$	`Duchamp`
`Target Completion` ${\textstyle \rightarrow}$	`duchampian`
`Target Completion` ${\textstyle \rightarrow}$	`marcel du champ`
`Target Completion` ${\textstyle \rightarrow}$	`Marcel Duchamp`
`Target Completion` ${\textstyle \rightarrow}$	`MARCEL DUCHAMP`

`Context` ${\textstyle \rightarrow}$	`Q: What school did burne hogarth establish?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`School of Visual Arts`

`Context` ${\textstyle \rightarrow}$	`Keinesfalls dürfen diese für den kommerziellen Gebrauch verwendet werden. =`
`Target Completion` ${\textstyle \rightarrow}$	`In no case may they be used for commercial purposes.`

`Context` ${\textstyle \rightarrow}$	`In no case may they be used for commercial purposes. =`
`Target Completion` ${\textstyle \rightarrow}$	`Keinesfalls dürfen diese für den kommerziellen Gebrauch verwendet werden.`

`Context` ${\textstyle \rightarrow}$	`Analysis of instar distributions of larval I. verticalis collected from a series of ponds also indicated that males were in more advanced instars than females. =`
`Target Completion` ${\textstyle \rightarrow}$	`L’analyse de la distribution de fréquence des stades larvaires d’I. verticalis dans une série d’étangs a également démontré que les larves mâles étaient à des stades plus avancés que les larves femelles.`

`Context` ${\textstyle \rightarrow}$	`L’analyse de la distribution de fréquence des stades larvaires d’I. verticalis dans une série d’étangs a également démontré que les larves mâles étaient à des stades plus avancés que les larves femelles. =`
`Target Completion` ${\textstyle \rightarrow}$	`Analysis of instar distributions of larval I. verticalis collected from a series of ponds also indicated that males were in more advanced instars than females.`

`Context` ${\textstyle \rightarrow}$	`The truth is that you want, at any price, and against the wishes of the peoples of Europe, to continue the negotiations for Turkey’s accession to the European Union, despite Turkey’s continuing refusal to recognise Cyprus and despite the fact that the democratic reforms are at a standstill. =`
`Target Completion` ${\textstyle \rightarrow}$	`Adevărul este că vă doriţi, cu orice preţ şi împotriva dorinţei europenilor, să continuaţi negocierile de aderare a Turciei la Uniunea Europeană, în ciuda refuzului continuu al Turciei de a recunoaşte Ciprul şi în ciuda faptului că reformele democratice au ajuns într-un punct mort.`

`Context` ${\textstyle \rightarrow}$	`Adevărul este că vă doriţi, cu orice preţ şi împotriva dorinţei europenilor, să continuaţi negocierile de aderare a Turciei la Uniunea Europeană, în ciuda refuzului continuu al Turciei de a recunoaşte Ciprul şi în ciuda faptului că reformele democratice au ajuns într-un punct mort. =`
`Target Completion` ${\textstyle \rightarrow}$	`The truth is that you want, at any price, and against the wishes of the peoples of Europe, to continue the negotiations for Turkey’s accession to the European Union, despite Turkey’s continuing refusal to recognise Cyprus and despite the fact that the democratic reforms are at a standstill.`

`Context` ${\textstyle \rightarrow}$	`Q: What is (2 * 4) * 6?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`48`

`Context` ${\textstyle \rightarrow}$	`Q: What is 17 minus 14?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`3`

`Context` ${\textstyle \rightarrow}$	`Q: What is 98 plus 45?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`143`

`Context` ${\textstyle \rightarrow}$	`Q: What is 95 times 45?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`4275`

`Context` ${\textstyle \rightarrow}$	`Q: What is 509 minus 488?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`21`

`Context` ${\textstyle \rightarrow}$	`Q: What is 556 plus 497?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`1053`

`Context` ${\textstyle \rightarrow}$	`Q: What is 6209 minus 3365?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`2844`

`Context` ${\textstyle \rightarrow}$	`Q: What is 9923 plus 617?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`10540`

`Context` ${\textstyle \rightarrow}$	`Q: What is 40649 minus 78746?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`-38097`

`Context` ${\textstyle \rightarrow}$	`Q: What is 65360 plus 16204?`
	`A:`
`Target Completion` ${\textstyle \rightarrow}$	`81564`

附录 H 所有任务在所有模型规模上的结果

					Zero-Shot								One-Shot								Few-Shot
名称	指标	划分	微调 SOTA	K	Small	Med	Large	XL	2.7B	6.7B	13B	175B	Small	Med	Large	XL	2.7B	6.7B	13B	175B	Small	Med	Large	XL	2.7B	6.7B	13B	175B	175B (test server)
HellaSwag	acc	dev	85.6	20	33.7	43.6	51.0	54.7	62.8	67.4	70.9	78.9	33.0	42.9	50.5	53.5	61.9	66.5	70.0	78.1	33.5	43.1	51.3	54.9	62.9	67.3	71.3	79.3
LAMBADA	acc	test	68.0	15	42.7	54.3	60.4	63.6	67.1	70.3	72.5	76.2	22.0	47.1	52.6	58.3	61.1	65.4	69.0	72.5	22.0	40.4	63.2	57.0	78.1	79.1	81.3	86.4
LAMBADA	ppl	test	8.63	15	18.6	9.09	6.53	5.44	4.60	4.00	3.56	3.00	165.0	11.6	8.29	6.46	5.53	4.61	4.06	3.35	165.0	27.6	6.63	7.45	2.89	2.56	2.56	1.92
StoryCloze	acc	test	91.8	70	63.3	68.5	72.4	73.4	77.2	77.7	79.5	83.2	62.3	68.7	72.3	74.2	77.3	78.7	79.7	84.7	62.3	70.2	73.9	76.1	80.2	81.2	83.0	87.7
NQs	acc	test	44.5	64	0.64	1.75	2.71	4.40	6.01	5.79	7.84	14.6	1.19	3.07	4.79	5.43	8.73	9.78	13.7	23.0	1.72	4.46	7.89	9.72	13.2	17.0	21.0	29.9
TriviaQA	acc	dev	68.0	64	4.15	7.61	14.0	19.7	31.3	38.7	41.8	64.3	4.19	12.9	20.5	26.5	35.9	44.4	51.3	68.0	6.96	16.3	26.5	32.1	42.3	51.6	57.5	71.2	71.2
WebQs	acc	test	45.5	64	1.77	3.20	4.33	4.63	7.92	7.73	8.22	14.4	2.56	6.20	8.51	9.15	14.5	15.1	19.0	25.3	5.46	12.6	15.9	19.6	24.8	27.7	33.5	41.5
Ro ${\textstyle \rightarrow}$ En 16	BLEU-mb	test	39.9	64	2.08	2.71	3.09	3.15	16.3	8.34	20.2	19.9	0.55	15.4	23.0	26.3	30.6	33.2	35.6	38.6	1.25	20.7	25.8	29.2	33.1	34.8	37.0	39.5
Ro ${\textstyle \rightarrow}$ En 16	BLEU-sb	test		64	2.39	3.08	3.49	3.56	16.8	8.75	20.8	20.9	0.65	15.9	23.6	26.8	31.3	34.2	36.7	40.0	1.40	21.3	26.6	30.1	34.3	36.2	38.4	41.3
En ${\textstyle \rightarrow}$ Ro 16	BLEU-mb	test	38.5	64	2.14	2.65	2.53	2.50	3.46	4.24	5.32	14.1	0.35	3.30	7.89	8.72	13.2	15.1	17.3	20.6	1.25	5.90	9.33	10.7	14.3	16.3	18.0	21.0
En ${\textstyle \rightarrow}$ Ro 16	BLEU-sb	test		64	2.61	3.11	3.07	3.09	4.26	5.31	6.43	18.0	0.55	3.90	9.15	10.3	15.7	18.2	20.8	24.9	1.64	7.40	10.9	12.9	17.2	19.6	21.8	25.8
Fr ${\textstyle \rightarrow}$ En 14	BLEU-mb	test	35.0	64	1.81	2.53	3.47	3.13	20.6	15.1	21.8	21.2	1.28	15.9	23.7	26.3	29.0	30.5	30.2	33.7	4.98	25.5	28.5	31.1	33.7	34.9	36.6	39.2
Fr ${\textstyle \rightarrow}$ En 14	BLEU-sb	test		64	2.29	2.99	3.90	3.60	21.2	15.5	22.4	21.9	1.50	16.3	24.4	27.0	30.0	31.6	31.4	35.6	5.30	26.2	29.5	32.2	35.1	36.4	38.3	41.4
En ${\textstyle \rightarrow}$ Fr 14	BLEU-mb	test	45.6	64	1.74	2.16	2.73	2.15	15.1	8.82	12.0	25.2	0.49	8.00	14.8	15.9	20.3	23.3	24.9	28.3	4.08	14.5	19.3	21.5	24.9	27.3	29.5	32.6
En ${\textstyle \rightarrow}$ Fr 14	BLEU-sb	test	45.9	64	2.44	2.75	3.54	2.82	19.3	11.4	15.3	31.3	0.81	10.0	18.2	19.3	24.7	28.3	30.1	34.1	5.31	18.0	23.6	26.1	30.3	33.3	35.5	39.9
De ${\textstyle \rightarrow}$ En 16	BLEU-mb	test	40.2	64	2.06	2.87	3.41	3.63	21.5	17.3	23.0	27.2	0.83	16.2	22.5	24.7	28.2	30.7	33.0	30.4	3.25	22.7	26.2	29.2	32.7	34.8	37.3	40.6
De ${\textstyle \rightarrow}$ En 16	BLEU-sb	test		64	2.39	3.27	3.85	4.04	22.5	18.2	24.4	28.6	0.93	17.1	23.4	25.8	29.2	31.9	34.5	32.1	3.60	23.8	27.5	30.5	34.1	36.5	39.1	43.0
En ${\textstyle \rightarrow}$ De 16	BLEU-mb	test	41.2	64	1.70	2.27	2.31	2.43	12.9	8.66	10.4	24.6	0.50	7.00	12.9	13.1	18.3	20.9	22.5	26.2	3.42	12.3	15.4	17.1	20.9	23.0	26.6	29.7
En ${\textstyle \rightarrow}$ De 16	BLEU-sb	test	41.2	64	2.09	2.65	2.75	2.92	13.7	9.36	11.0	25.3	0.54	7.40	13.4	13.4	18.8	21.7	23.3	27.3	3.78	12.9	16.1	17.7	21.7	24.1	27.7	30.9
Winograd	acc	test	93.8	7	66.3	72.9	74.7	76.9	82.4	85.7	87.9	88.3	63.4	68.5	72.9	76.9	82.4	84.6	86.1	89.7	63.4	67.4	73.6	76.9	84.3	85.4	82.4	88.6
Winogrande	acc	dev	84.6	50	52.0	52.1	57.4	58.7	62.3	64.5	67.9	70.2	51.3	53.0	58.3	59.1	61.7	65.8	66.9	73.2	51.3	52.6	57.5	59.1	62.6	67.4	70.0	77.7
PIQA	acc	dev	77.1	50	64.6	70.2	72.9	75.1	75.6	78.0	78.5	81.0	64.3	69.3	71.8	74.4	74.3	76.3	77.8	80.5	64.3	69.4	72.0	74.3	75.4	77.8	79.9	82.3	82.8
ARC (Challenge)	acc	test	78.5	50	26.6	29.5	31.8	35.5	38.0	41.4	43.7	51.4	25.5	30.2	31.6	36.4	38.4	41.5	43.1	53.2	25.5	28.4	32.3	36.7	39.5	43.7	44.8	51.5
ARC (Easy)	acc	test	92.0	50	43.6	46.5	53.0	53.8	58.2	60.2	63.8	68.8	42.7	48.2	54.6	55.9	60.3	62.6	66.8	71.2	42.7	51.0	58.1	59.1	62.1	65.8	69.1	70.1
OpenBookQA	acc	test	87.2	100	35.6	43.2	45.2	46.8	53.0	50.4	55.6	57.6	37.0	39.8	46.2	46.4	53.4	53.0	55.8	58.8	37.0	43.6	48.0	50.6	55.6	55.2	60.8	65.4
Quac	f1	dev	74.4	5	21.2	26.8	31.0	30.1	34.7	36.1	38.4	41.5	21.1	26.9	31.9	32.3	37.4	39.0	40.6	43.4	21.6	27.6	32.9	34.2	38.2	39.9	40.9	44.3
RACE-h	acc	test	90.0	10	35.2	37.9	40.1	40.9	42.4	44.1	44.6	45.5	34.3	37.7	40.0	42.0	43.8	44.3	44.6	45.9	34.3	37.0	40.4	41.4	42.3	44.7	45.1	46.8
RACE-m	acc	test	93.1	10	42.1	47.2	52.1	52.3	54.7	54.4	56.7	58.4	42.3	47.3	51.7	55.2	56.1	54.7	56.9	57.4	42.3	47.0	52.7	53.0	55.6	55.4	58.1	58.1
SQuADv2	em	dev	90.7	16	22.6	32.8	33.9	43.1	43.6	45.4	49.0	52.6	25.1	37.5	37.9	47.9	47.9	51.1	56.0	60.1	27.5	40.5	39.2	53.5	50.0	56.6	62.6	64.9
SQuADv2	f1	dev	93.0	16	28.3	40.2	41.4	50.3	51.0	52.7	56.3	59.5	30.1	43.6	44.1	54.0	54.1	57.1	61.8	65.4	32.1	45.5	44.9	58.7	55.9	62.1	67.7	69.8
CoQA	f1	dev	90.7	5	34.5	55.0	61.8	65.3	71.1	72.8	76.3	81.5	30.6	52.1	61.6	66.1	71.8	75.1	77.9	84.0	31.1	52.0	62.7	66.8	73.2	77.3	79.9	85.0
DROP	f1	dev	89.1	20	9.40	13.6	14.4	16.4	19.7	17.0	24.0	23.6	11.7	18.1	20.9	23.0	26.4	27.3	29.2	34.3	12.9	18.7	24.0	25.6	29.7	29.7	32.3	36.5
BoolQ	acc	dev	91.0	32	49.7	60.3	58.9	62.4	67.1	65.4	66.2	60.5	52.6	61.7	60.4	63.7	68.4	68.7	69.0	76.7	43.1	60.6	62.0	64.1	70.3	70.0	70.2	77.5	76.4
CB	acc	dev	96.9	32	0.00	32.1	8.93	19.6	19.6	28.6	19.6	46.4	55.4	53.6	53.6	48.2	57.1	33.9	55.4	64.3	42.9	58.9	53.6	69.6	67.9	60.7	66.1	82.1	75.6
CB	f1	dev	93.9	32	0.00	29.3	11.4	17.4	22.4	25.1	20.3	42.8	60.1	39.8	45.6	37.5	45.7	28.5	44.6	52.5	26.1	40.4	32.6	48.3	45.7	44.6	46.0	57.2	52.0
Copa	acc	dev	94.8	32	66.0	68.0	73.0	77.0	76.0	80.0	84.0	91.0	62.0	64.0	66.0	74.0	76.0	82.0	86.0	87.0	67.0	64.0	72.0	77.0	83.0	83.0	86.0	92.0	92.0
RTE	acc	dev	92.5	32	47.7	49.8	48.4	56.0	46.6	55.2	62.8	63.5	53.1	47.3	49.5	49.5	54.9	54.9	56.3	70.4	52.3	48.4	46.9	50.9	56.3	49.5	60.6	72.9	69.0
WiC	acc	dev	76.1	32	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	50.0	50.3	50.3	49.2	49.4	50.3	50.0	48.6	49.8	55.0	53.0	53.0	51.6	53.1	51.1	55.3	49.4
WSC	acc	dev	93.8	32	59.6	56.7	65.4	61.5	66.3	60.6	64.4	65.4	58.7	58.7	60.6	62.5	66.3	60.6	66.3	69.2	58.7	60.6	54.8	49.0	62.5	67.3	75.0	75.0	80.1
MultiRC	acc	dev	62.3	32	4.72	9.65	12.3	13.6	14.3	18.4	24.2	27.6	4.72	9.65	12.3	13.6	14.3	18.4	24.2	27.6	6.09	11.8	16.8	20.8	24.7	23.8	25.0	32.5	30.5
MultiRC	f1a	dev	88.2	32	57.0	59.7	60.4	59.9	60.0	64.5	71.4	72.9	57.0	59.7	60.4	59.9	60.0	64.5	71.4	72.9	45.0	55.9	64.2	65.4	69.5	66.4	69.3	74.8	75.4
ReCoRD	acc	dev	92.5	32	70.8	78.5	82.1	84.1	86.2	88.6	89.0	90.2	69.8	77.0	80.7	83.0	85.9	88.0	88.8	90.2	69.8	77.2	81.3	83.1	86.6	87.9	88.9	89.0	90.2
ReCoRD	f1	dev	93.3	32	71.9	79.2	82.8	85.2	87.3	89.5	90.4	91.0	70.7	77.8	81.6	83.9	86.8	88.8	89.7	91.2	70.7	77.9	82.1	84.0	87.5	88.8	89.8	90.1	91.1
SuperGLUE	average	dev	89.0		40.6	47.4	46.8	49.6	50.1	52.3	54.4	58.2	54.4	55.1	56.7	57.8	61.2	59.7	64.3	68.9	50.2	56.2	56.8	60.0	64.3	63.6	66.9	73.2	71.8
ANLI R1	acc	test	73.8	50	33.4	34.2	33.4	33.4	34.2	32.3	33.2	34.6	32.1	31.6	31.9	34.6	30.6	31.6	32.7	32.0	32.1	32.5	30.9	32.5	33.5	33.1	33.3	36.8
ANLI R2	acc	test	50.7	50	33.2	31.9	33.3	33.3	33.8	33.5	33.5	35.4	35.7	33.7	33.2	32.7	32.7	33.9	33.9	33.9	35.7	33.8	32.1	31.4	32.6	33.3	32.6	34.0
ANLI R3	acc	test	48.3	50	33.6	34.0	33.8	33.4	35.3	34.8	34.4	34.5	35.0	32.6	33.0	33.9	34.1	33.1	32.5	35.1	35.0	34.4	35.1	36.0	32.7	33.9	34.5	40.2
2D+	acc	n/a		50	0.70	0.65	0.70	0.85	1.10	2.54	15.4	76.9	2.00	0.55	3.15	4.00	12.1	19.6	73.0	99.6	2.00	4.10	3.50	4.50	8.90	11.9	55.5	100.0
2D-	acc	n/a		50	1.25	1.25	1.25	1.25	1.60	7.60	12.6	58.0	1.15	0.95	1.45	1.95	3.85	11.5	44.6	86.4	1.15	1.45	2.25	2.70	7.35	13.6	52.4	98.9
3D+	acc	n/a		50	0.10	0.10	0.05	0.10	0.10	0.25	1.40	34.2	0.15	0.00	0.10	0.30	0.45	0.95	15.4	65.5	0.15	0.45	0.30	0.55	0.75	0.90	8.40	80.4
3D-	acc	n/a		50	0.05	0.05	0.05	0.05	0.05	0.45	1.35	48.3	0.05	0.15	0.25	0.30	0.55	1.60	6.15	78.7	0.05	0.10	0.15	0.35	0.65	1.05	9.20	94.2
4D+	acc	n/a		50	0.05	0.05	0.00	0.00	0.05	0.05	0.15	4.00	0.00	0.00	0.10	0.00	0.00	0.10	0.80	14.0	0.00	0.05	0.05	0.00	0.15	0.15	0.40	25.5
4D-	acc	n/a		50	0.00	0.00	0.00	0.00	0.00	0.00	0.10	7.50	0.00	0.00	0.00	0.00	0.05	0.00	0.50	14.0	0.00	0.05	0.00	0.00	0.10	0.05	0.40	26.8
5D+	acc	n/a		50	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.65	0.00	0.00	0.00	0.00	0.00	0.00	0.05	3.45	0.00	0.00	0.00	0.00	0.00	0.00	0.05	9.30
5D-	acc	n/a		50	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.80	0.00	0.00	0.00	0.00	0.00	0.00	0.05	3.75	0.00	0.00	0.00	0.00	0.00	0.00	0.00	9.90
2Dx	acc	n/a		50	2.20	2.25	2.65	2.10	2.55	5.80	6.15	19.8	1.35	2.35	3.35	2.35	4.75	9.15	11.0	27.4	1.35	2.90	2.70	2.85	4.25	6.10	7.05	29.2
1DC	acc	n/a		50	1.25	2.95	2.75	0.05	0.30	2.35	0.75	9.75	1.90	2.80	2.85	3.65	6.45	9.15	8.20	14.3	1.70	2.15	3.90	5.75	6.20	7.60	9.95	21.3
Cycled Letters	acc	n/a		100	0.62	0.71	2.85	0.00	0.63	1.35	2.58	3.66	1.67	4.36	5.68	6.46	6.25	9.41	15.1	21.7	4.63	9.27	10.7	14.5	16.7	21.9	27.7	37.9
Anagrams 1	acc	n/a		100	0.10	0.14	0.40	0.00	0.27	0.69	1.16	2.28	0.21	0.61	1.12	1.27	1.60	2.72	3.72	8.62	0.50	1.27	2.13	3.05	3.81	5.49	8.38	15.1
Anagrams 2	acc	n/a		100	0.81	1.21	2.69	0.01	1.71	3.75	4.53	8.91	1.19	2.62	4.70	4.77	6.97	10.2	14.6	25.9	1.94	4.80	7.59	9.87	12.6	18.9	25.6	39.7
Symbol Insertion	acc	n/a		100	0.00	0.00	0.10	0.00	0.05	0.42	0.89	8.26	0.03	0.05	0.57	1.18	1.67	3.46	6.62	45.4	0.11	0.28	2.19	4.18	6.61	11.0	27.3	67.2
Reversed Words	acc	n/a		100	0.00	0.01	0.01	0.01	0.02	0.03	0.03	0.09	0.02	0.01	0.01	0.00	0.05	0.07	0.11	0.48	0.00	0.05	0.00	0.17	0.24	0.30	0.42	0.44
SAT Analogies	acc	n/a		20	35.6	39.0	45.2	44.1	50.0	49.2	52.7	53.7	30.5	41.2	43.1	46.5	55.1	54.3	53.5	59.1	30.5	40.4	42.8	40.6	48.4	51.9	53.5	65.2

参考文献

ADG⁺ [16] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by Lua error: Internal error: The interpreter exited with status 1. by Lua error: Internal error: The interpreter exited with status 1.. In Advances in neural information processing systems, pages 3981–3989, 2016.
AI [19] WeChat AI. Tr-mt (ensemble), December 2019.
AJF [19] Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
BBDIW [20] Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. Language (technology) is power: A critical survey of “bias” in nlp. arXiv preprint arXiv:2005.14050, 2020.
BCFL [13] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533–1544, 2013.
BDD⁺ [09] Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. The fifth PASCAL recognizing textual entailment challenge. 2009.
BES [10] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In Lrec, volume 10, pages 2200–2204, 2010.
BHDD⁺ [06] Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second PASCAL recognising textual entailment challenge. 2006.
BHT⁺ [20] Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, et al. Experience grounds language. arXiv preprint arXiv:2004.10151, 2020.
BLC [13] Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. Arxiv, 2013.
BZB⁺ [19] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641, 2019.
Car [97] Rich Caruana. Multitask learning. Machine learning, 28(1), 1997.
CB [78] Susan Carey and Elsa Bartlett. Acquiring a single new word. Proceedings of the Stanford Child Language Conference, 1978.
CCE⁺ [18] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
CGRS [19] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse Lua error: Internal error: The interpreter exited with status 1., 2019.
CHI⁺ [18] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac : Question answering in context. Arxiv, 2018.
CLC⁺ [19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
CLY⁺ [19] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740, 2019.
Cra [17] Kate Crawford. The trouble with bias. NIPS 2017 Keynote, 2017.
DCLT [18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Lua error: Internal error: The interpreter exited with status 1. of deep bidirectional Lua error: Internal error: The interpreter exited with status 1. for language understanding. arXiv preprint arXiv:1810.04805, 2018.
DGM [06] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising textual entailment, pages 177–190. Springer, 2006.
DGV⁺ [18] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal Lua error: Internal error: The interpreter exited with status 1.. Arxiv, 2018.
DHKH [14] Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Heafield. Edinburgh’s phrase-based machine translation systems for wmt-14. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 97–104, 2014.
DL [15] Andrew M. Dai and Quoc V. Le. Semi-supervised sequence learning. In Advances in neural information processing systems, 2015.
DMST [19] Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.
DSC⁺ [16] Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl²: Fast reinforcement learning via slow reinforcement learning. ArXiv, abs/1611.02779, 2016.
DWD⁺ [19] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161, 2019.
DYY⁺ [19] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Lua error: Internal error: The interpreter exited with status 1.-xl: Attentive language models beyond a fixed-length context. Arxiv, 2019.
EOAG [18] Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381, 2018.
FAL [17] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. ArXiv, abs/1703.03400, 2017.
Fyo [00] Yaroslav Fyodorov. A natural logic inference system, 2000.
GG [19] Hila Gonen and Yoav Goldberg. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word Lua error: Internal error: The interpreter exited with status 1. but do not remove them. arXiv preprint arXiv:1903.03862, 2019.
GLT⁺ [20] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model Lua error: Internal error: The interpreter exited with status 1.. arXiv preprint arXiv:2002.08909, 2020.
GMDD [07] Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1–9. Association for Computational Linguistics, 2007.
Gra [16] Alex Graves. Adaptive computation time for recurrent neural networks. Arxiv, 2016.
GSL⁺ [18] Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324, 2018.
GSR [19] Sebastian Gehrmann, Hendrik Strobelt, and Alexander M. Rush. Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv: 1906.04043, 2019.
GWC⁺ [18] Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. Meta-learning for low-resource neural machine translation. arXiv preprint arXiv:1808.08437, 2018.
HB [20] Daniel Hernandez and Tom Brown. Ai and efficiency, May 2020.
HBFC [19] Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. CoRR, abs/1904.09751, 2019.
HLW⁺ [20] Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Lua error: Internal error: The interpreter exited with status 1. Dziedzic, Rishabh Krishnan, and Dawn Song. Pretrained Lua error: Internal error: The interpreter exited with status 1. improve out of distribution robustness. arXiv preprint arXiv:2004.06100, 2020.
HNA⁺ [17] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Lua error: Internal error: The interpreter exited with status 1. scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
HR [18] Jeremy Howard and Sebastian Ruder. Universal language model Lua error: Internal error: The interpreter exited with status 1. for text classification. arXiv preprint arXiv:1801.06146, 2018.
HVD [15] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
HYC [01] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to Learn Using Gradient Descent. In International Conference on Artificial Neural Networks, pages 87–94. Springer, 2001.
HZJ⁺ [19] Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stanforth, Johannes Welbl, Jack Rae, Vishal Maini, Dani Yogatama, and Pushmeet Kohli. Reducing sentiment bias in language models via counterfactual evaluation. arXiv preprint arXiv:1911.03064, 2019.
IBGC⁺ [14] Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daumé III. A neural network for factoid question answering over paragraphs. In Empirical Methods in Natural Language Processing, 2014.
IDCBE [19] Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of generated text is easiest when humans are fooled. arXiv preprint arXiv:1911.00650, 2019.
JCWZ [17] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
JN [20] Zheng Junyuan and Gamma Lab NYC. Numeric Lua error: Internal error: The interpreter exited with status 1. - albert, March 2020.
JVS⁺ [16] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
JYS⁺ [19] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
JZC⁺ [19] Ying Ju, Fubang Zhao, Shijie Chen, Bowen Zheng, Xuefeng Yang, and Yunfeng Liu. Technical report on conversational question answering. arXiv preprint arXiv:1909.10772, 2019.
KCR⁺ [18] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
KKS⁺ [20] Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700, 2020.
KMB [20] Sarah E. Kreps, Miles McCain, and Miles Brundage. All the news that’s fit to fabricate: Ai-generated text as a tool of media misinformation, 2020.
KMH⁺ [20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.
KPR⁺ [19] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
KR [16] Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. Arxiv, 2016.
LB [02] Edward Loper and Steven Bird. Nltk: The natural language toolkit, 2002.
LC [19] Guillaume Lample and Alexis Conneau. Cross-lingual language model Lua error: Internal error: The interpreter exited with status 1.. arXiv preprint arXiv:1901.07291, 2019.
LCG⁺ [19] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for Lua error: Internal error: The interpreter exited with status 1. of language representations. arXiv preprint arXiv:1909.11942, 2019.
LCH⁺ [20] Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020.
LDL [19] Zhongyang Li, Xiao Ding, and Ting Liu. Story ending prediction by transferable bert. arXiv preprint arXiv:1905.07504, 2019.
LDM [12] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012.
LGG⁺ [20] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising Lua error: Internal error: The interpreter exited with status 1. for neural machine translation. arXiv preprint arXiv:2001.08210, 2020.
LGH⁺ [15] Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015.
LH [17] Ilya Loshchilov and Frank Hutter. Decoupled Lua error: Internal error: The interpreter exited with status 1. Lua error: Internal error: The interpreter exited with status 1.. arXiv preprint arXiv:1711.05101, 2017.
[69] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482, 2019.
[70] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019.
Lin [20] Tal Linzen. How can we accelerate progress towards human-like linguistic generalization? arXiv preprint arXiv:2005.00955, 2020.
LLG⁺ [19] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising Lua error: Internal error: The interpreter exited with status 1. Lua error: Internal error: The interpreter exited with status 1. for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
LM [17] Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441, 2017.
LOG⁺ [19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT Lua error: Internal error: The interpreter exited with status 1. approach. arXiv preprint arXiv:1907.11692, 2019.
LPP⁺ [20] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Kiela Douwe. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv preprint arXiv:2005.11401, 2020.
LSP⁺ [18] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.
LWS⁺ [20] Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. Train large, then compress: Rethinking model size for efficient training and inference of Lua error: Internal error: The interpreter exited with status 1., 2020.
LXL⁺ [17] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
LYN⁺ [20] Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, and Jimmy Lin. Tttttackling winogrande schemas. arXiv preprint arXiv:2003.08380, 2020.
Mac [92] David. MacKay. Information-based Lua error: Internal error: The interpreter exited with status 1. for active data selection. Neural Computation, 1992.
MBXS [17] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305, 2017.
MCCD [13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
MCH⁺ [16] Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and evaluation framework for deeper understanding of commonsense stories. arXiv preprint arXiv:1604.01696, 2016.
MCKS [18] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. ArXiv, abs/1809.02789, 2018.
MKAT [18] Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training, 2018.
MKM⁺ [94] Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: annotating predicate argument structure. In Proceedings of the workshop on Human Language Technology, pages 114–119. Association for Computational Linguistics, 1994.
MKXS [18] Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018.
MPL [19] R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007, 2019.
MWZ⁺ [18] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting, 2018.
NBR [20] Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456, 2020.
NK [19] Timothy Niven and Hung-Yu Kao. Probing neural network comprehension of natural language arguments. arXiv preprint arXiv:1907.07355, 2019.
Nor [09] Peter Norvig. Natural language corpus data, 2009.
NvNvdG [19] Malvina Nissim, Rik van Noord, and Rob van der Goot. Fair is better than sensational: Man is to doctor as woman is to doctor. arXiv preprint arXiv:1905.09866, 2019.
NWD⁺ [19] Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599, 2019.
oR [16] University of Regensburg. Fascha, 2016.
PCC [18] Mohammad Taher Pilehvar and Jose Camacho-Collados. WIC: 10,000 example pairs for evaluating context-sensitive representations. arXiv preprint arXiv:1808.09121, 2018.
PFB [18] Jason Phang, Thibault Févry, and Samuel R. Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088, 2018.
PHR⁺ [18] Lua error: Internal error: The interpreter exited with status 1. Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of EMNLP, 2018.
PKL⁺ [16] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
PNZtY [18] Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wen tau Yih. Dissecting contextual word Lua error: Internal error: The interpreter exited with status 1.: Architecture and representation, 2018.
Pos [18] Matt Post. A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771, 2018.
PSM [14] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014.
QIA [20] QIANXIN. Sa-net on albert (ensemble), April 2020.
QMZH [19] Yusu Qian, Urwa Muaz, Ben Zhang, and Jae Won Hyun. Reducing gender bias in word-level language models with a gender-equalizing Lua error: Internal error: The interpreter exited with status 1.. arXiv preprint arXiv:1905.12801, 2019.
RBG [11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
RCM [19] Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019.
RCP⁺ [17] Scott Reed, Yutian Chen, Thomas Paine, Aäron van den Oord, SM Eslami, Danilo Rezende, Oriol Vinyals, and Nando de Freitas. Few-shot autoregressive density estimation: Towards learning to learn distributions. arXiv preprint arXiv:1710.10304, 2017.
RJL [18] Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
RL [16] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. ICLR 2017 (oral), 2016.
RLL⁺ [19] Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. NumNet: Machine reading comprehension with numerical reasoning. In Proceedings of EMNLP, 2019.
RNLVD [18] Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301, 2018.
RNSS [18] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative Lua error: Internal error: The interpreter exited with status 1., 2018.
Ros [12] R.S. Ross. Guide for conducting risk assessments. NIST Special Publication, 2012.
RRBS [19] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019.
RRS [20] Lua error: Internal error: The interpreter exited with status 1. Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020.
RSR⁺ [19] Colin Raffel, Noam Shazeer, Lua error: Internal error: The interpreter exited with status 1. Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text Lua error: Internal error: The interpreter exited with status 1., 2019.
RWC⁺ [19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019.
SBBC [19] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019.
SBC⁺ [19] Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris McGuffie, and Jasmine Wang. Release strategies and the social impacts of language models, 2019.
SCNP [19] Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326, 2019.
SDCW [19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
SDSE [19] Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI. CoRR, abs/1907.10597, 2019.
SHB [15] Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709, 2015.
SMM⁺ [17] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated Lua error: Internal error: The interpreter exited with status 1. layer. arXiv preprint arXiv:1701.06538, 2017.
SPP⁺ [19] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019.
SS [20] Timo Schick and Hinrich Schütze. Exploiting cloze questions for few-shot text classification and natural language inference. arXiv preprint arXiv:2001.07676, 2020.
STQ⁺ [19] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked Lua error: Internal error: The interpreter exited with status 1. Lua error: Internal error: The interpreter exited with status 1. for language generation. arXiv preprint arXiv:1905.02450, 2019.
TFR⁺ [17] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
TL [05] Peter D. Turney and Michael L. Littman. Corpus-based learning of analogies and semantic relations. CoRR, abs/cs/0508103, 2005.
TL [18] Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847, 2018.
TLBS [03] Peter D. Turney, Michael L. Littman, Jeffrey Bigham, and Victor Shnayder. Combining independent modules to solve multiple-choice synonym and analogy problems. CoRR, cs.CL/0309035, 2003.
Tur [20] Project Turing. Microsoft research blog, Feb 2020.
VBL⁺ [16] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching Networks for One Shot Learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
VSP⁺ [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Lua error: Internal error: The interpreter exited with status 1. is all you need. In Advances in neural information processing systems, 2017.
WPN⁺ [19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pages 3261–3275, 2019.
WXH⁺ [18] Yiren Wang, Yingce Xia, Tianyu He, Fei Tian, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Multi-agent dual learning. ICLR 2019, 2018.
XDH⁺ [19] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised data augmentation for consistency training, 2019.
YdC⁺ [19] Dani Yogatama, Cyprien de Masson d’Autume, Jerome Connor, Tomas Kocisky, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, et al. Learning and evaluating general linguistic intelligence. arXiv preprint arXiv:1901.11373, 2019.
YDY⁺ [19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: Generalized autoregressive Lua error: Internal error: The interpreter exited with status 1. for language understanding. arXiv preprint arXiv:1906.08237, 2019.
ZHB⁺ [19] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
ZHR⁺ [19] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. arXiv preprint arXiv:1905.12616, 2019.
ZLL⁺ [18] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885, 2018.
[143] Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Lua error: Internal error: The interpreter exited with status 1. language models from human preferences, 2019.
[144] Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Lua error: Internal error: The interpreter exited with status 1. language models from human preferences. ArXiv, abs/1909.08593, 2019.