Language Modeling with Gated Convolutional Networks - Revision history

DeployBot: Marked this version for translation

2026-04-27T06:50:12Z

Marked this version for translation

← Older revision		Revision as of 06:50, 27 April 2026
Line 20:		Line 20:
	'''Language Modeling with Gated Convolutional Networks''' is a 2016 paper by Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier of Facebook AI Research that introduces the gated convolutional neural network (GCNN) for language modeling and the gated linear unit (GLU) activation. The paper challenges the prevailing assumption that recurrent networks are necessary to achieve state-of-the-art perplexity on large-scale language modeling benchmarks, showing that a finite-context, parallelizable convolutional stack equipped with multiplicative gating can match or exceed strong LSTM baselines while running an order of magnitude faster at inference. It was published at ICML 2017.		'''Language Modeling with Gated Convolutional Networks''' is a 2016 paper by Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier of Facebook AI Research that introduces the gated convolutional neural network (GCNN) for language modeling and the gated linear unit (GLU) activation. The paper challenges the prevailing assumption that recurrent networks are necessary to achieve state-of-the-art perplexity on large-scale language modeling benchmarks, showing that a finite-context, parallelizable convolutional stack equipped with multiplicative gating can match or exceed strong LSTM baselines while running an order of magnitude faster at inference. It was published at ICML 2017.

	<!--T:2-->		== Overview == <!--T:2-->
	~~== Overview ==~~

	<!--T:3-->		<!--T:3-->
Line 29:		Line 28:
	The contribution is twofold. First, an architecture: stacked causal 1-D convolutions arranged into pre-activation residual bottleneck blocks and capped with an adaptive softmax output layer. Second, an activation function: the gated linear unit, which preserves a linear gradient path while retaining gating's ability to modulate information flow. The combination yields convergence and final perplexity that compare favourably with strongly tuned LSTM baselines, and it does so with markedly better inference latency.		The contribution is twofold. First, an architecture: stacked causal 1-D convolutions arranged into pre-activation residual bottleneck blocks and capped with an adaptive softmax output layer. Second, an activation function: the gated linear unit, which preserves a linear gradient path while retaining gating's ability to modulate information flow. The combination yields convergence and final perplexity that compare favourably with strongly tuned LSTM baselines, and it does so with markedly better inference latency.

	<!--T:5-->		== Key Contributions == <!--T:5-->
	~~== Key Contributions ==~~

	<!--T:6-->		<!--T:6-->
Line 40:		Line 38:
	* Provides a controlled empirical comparison of gating mechanisms, showing GLU outperforms the tanh-based GTU of van den Oord et al. (2016), as well as plain ReLU and Tanh networks.		* Provides a controlled empirical comparison of gating mechanisms, showing GLU outperforms the tanh-based GTU of van den Oord et al. (2016), as well as plain ReLU and Tanh networks.

	<!--T:7-->		== Methods == <!--T:7-->
	~~== Methods ==~~

	<!--T:8-->		<!--T:8-->
Line 64:		Line 61:
	Architecturally each block contains up to five layers in a pre-activation bottleneck pattern (a wider <math>k>1</math> convolution sandwiched between two <math>k=1</math> projections) and is wrapped in a residual addition. Models in the paper range from 8 to 14 blocks, with hidden widths of 800–2048 and embedding sizes of 128–280. Training uses Nesterov momentum with momentum 0.99, gradient clipping at 0.1, weight normalization, Kaiming initialization, and learning rates between 1.0 and 2.0. The use of gradient clipping — usually motivated by recurrent gradient explosion — is justified here from a trust-region perspective and substantially speeds up training. Implementation is in Torch on Tesla M40 GPUs, with the largest models trained on 8 GPUs via synchronous data-parallel SGD.		Architecturally each block contains up to five layers in a pre-activation bottleneck pattern (a wider <math>k>1</math> convolution sandwiched between two <math>k=1</math> projections) and is wrapped in a residual addition. Models in the paper range from 8 to 14 blocks, with hidden widths of 800–2048 and embedding sizes of 128–280. Training uses Nesterov momentum with momentum 0.99, gradient clipping at 0.1, weight normalization, Kaiming initialization, and learning rates between 1.0 and 2.0. The use of gradient clipping — usually motivated by recurrent gradient explosion — is justified here from a trust-region perspective and substantially speeds up training. Implementation is in Torch on Tesla M40 GPUs, with the largest models trained on 8 GPUs via synchronous data-parallel SGD.

	<!--T:15-->		== Results == <!--T:15-->
	~~== Results ==~~

	<!--T:16-->		<!--T:16-->
Line 73:		Line 69:
	For computational efficiency, a GCNN-8 Bottleneck matches the throughput of a heavily cuDNN-optimized LSTM-2048 (about 45,800 tokens/s on GPU) at the same 43.9 perplexity operating point, while delivering 20× better responsiveness (per-token sequential latency) because each token can be evaluated independently rather than waiting on a recurrent hidden state. The ablation over gating mechanisms shows GLU converges faster and to lower perplexity than GTU, ReLU, or Tanh, and that learning-curve gaps between gated and ungated variants are large and consistent across both datasets. Increasing the receptive field beyond roughly 20 tokens of context yields diminishing returns, supporting the claim that finite contexts suffice for the bulk of practical language modeling.		For computational efficiency, a GCNN-8 Bottleneck matches the throughput of a heavily cuDNN-optimized LSTM-2048 (about 45,800 tokens/s on GPU) at the same 43.9 perplexity operating point, while delivering 20× better responsiveness (per-token sequential latency) because each token can be evaluated independently rather than waiting on a recurrent hidden state. The ablation over gating mechanisms shows GLU converges faster and to lower perplexity than GTU, ReLU, or Tanh, and that learning-curve gaps between gated and ungated variants are large and consistent across both datasets. Increasing the receptive field beyond roughly 20 tokens of context yields diminishing returns, supporting the claim that finite contexts suffice for the bulk of practical language modeling.

	<!--T:18-->		== Impact == <!--T:18-->
	~~== Impact ==~~

	<!--T:19-->		<!--T:19-->
Line 85:		Line 80:
	A secondary methodological legacy is the paper's clean separation of '''throughput''' (tokens/second under batching) from '''responsiveness''' (per-token sequential latency). This distinction has since become standard when evaluating sequence models for production deployment, where a model with the same training-time throughput as a baseline may nevertheless be unusable if it cannot decode a single sequence quickly enough.		A secondary methodological legacy is the paper's clean separation of '''throughput''' (tokens/second under batching) from '''responsiveness''' (per-token sequential latency). This distinction has since become standard when evaluating sequence models for production deployment, where a model with the same training-time throughput as a baseline may nevertheless be unusable if it cannot decode a single sequence quickly enough.

	<!--T:22-->		== See also == <!--T:22-->
	~~== See also ==~~

	<!--T:23-->		<!--T:23-->
Line 97:		Line 91:
	* [[WikiText-103]]		* [[WikiText-103]]

	<!--T:24-->		== References == <!--T:24-->
	~~== References ==~~

	<!--T:25-->		<!--T:25-->

DeployBot: [deploy-bot] Claude-authored from arxiv:1612.08083

2026-04-27T06:50:12Z

[deploy-bot] Claude-authored from arxiv:1612.08083

New page

<languages />
{{PaperTabs}}
{{PaperInfobox
| topic_area = NLP
| difficulty = Research
| authors = Yann N. Dauphin; Angela Fan; Michael Auli; David Grangier
| year = 2016
| arxiv_id = 1612.08083
| source_url = https://arxiv.org/abs/1612.08083
| pdf_url = https://arxiv.org/pdf/1612.08083.pdf
}}
{{ContentMeta
| generated_by = claude-code-direct
| model_used = claude-opus-4-7
| generated_date = 2026-04-27
}}

<translate>

'''Language Modeling with Gated Convolutional Networks''' is a 2016 paper by Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier of Facebook AI Research that introduces the gated convolutional neural network (GCNN) for language modeling and the gated linear unit (GLU) activation. The paper challenges the prevailing assumption that recurrent networks are necessary to achieve state-of-the-art perplexity on large-scale language modeling benchmarks, showing that a finite-context, parallelizable convolutional stack equipped with multiplicative gating can match or exceed strong LSTM baselines while running an order of magnitude faster at inference. It was published at ICML 2017.


== Overview ==


Statistical language models estimate the probability of a token sequence by factoring it into a product of next-word conditionals. Until 2016 the dominant neural approach used recurrent networks — typically LSTMs — whose strength was attributed to their unbounded effective context. The authors argue that this property is not strictly required: a sufficiently deep convolutional stack can represent contexts large enough for practical language modeling, and the absence of a temporal recurrence makes such a model trivially parallel over the time dimension.


The contribution is twofold. First, an architecture: stacked causal 1-D convolutions arranged into pre-activation residual bottleneck blocks and capped with an adaptive softmax output layer. Second, an activation function: the gated linear unit, which preserves a linear gradient path while retaining gating's ability to modulate information flow. The combination yields convergence and final perplexity that compare favourably with strongly tuned LSTM baselines, and it does so with markedly better inference latency.


== Key Contributions ==


* Proposes the '''Gated Linear Unit (GLU)''', a multiplicative activation in which one linear convolution is gated by the sigmoid of a parallel convolution, yielding a non-vanishing linear gradient path.
* Introduces the '''Gated Convolutional Network (GCNN)''' — stacked causal 1-D convolutions with residual bottleneck blocks and adaptive softmax — as the first non-recurrent model competitive with LSTMs on large-scale language modeling.
* Achieves a new single-model state of the art of 37.2 perplexity on '''WikiText-103''', surpassing the previous LSTM-1024 baseline at 48.7.
* Establishes a new best single-GPU result on the '''Google Billion Word''' benchmark and reaches 31.9 perplexity with an 8-GPU configuration trained for 2 weeks (versus 30.6 perplexity for the LSTM of Jozefowicz et al. trained for 3 weeks on 32 GPUs).
* Demonstrates a '''20× improvement in responsiveness''' (sequential per-token latency) over a comparable LSTM, by exploiting the parallel structure of convolutions.
* Provides a controlled empirical comparison of gating mechanisms, showing GLU outperforms the tanh-based GTU of van den Oord et al. (2016), as well as plain ReLU and Tanh networks.


== Methods ==


The model takes a sequence of word embeddings <math>\mathbf{E} = [\mathbf{D}_{w_0}, \ldots, \mathbf{D}_{w_N}]</math> and feeds it through a stack of residual blocks, each computing the gated linear unit


:<math>h_l(\mathbf{X}) = (\mathbf{X} \ast \mathbf{W} + \mathbf{b}) \otimes \sigma(\mathbf{X} \ast \mathbf{V} + \mathbf{c})</math>


where <math>\ast</math> is 1-D convolution along the time axis, <math>\sigma</math> is the sigmoid, and <math>\otimes</math> is element-wise multiplication. Causality is enforced by zero-padding the left of the input by <math>k-1</math> positions so that the kernel never sees future tokens. The output of the stack is fed to an '''adaptive softmax''' that allocates more capacity to frequent words, dramatically reducing the cost of the output distribution for vocabularies of hundreds of thousands of types.


The choice of activation is the central methodological contribution. The gradient of the GLU,


:<math>\nabla[\mathbf{X} \otimes \sigma(\mathbf{X})] = \nabla\mathbf{X} \otimes \sigma(\mathbf{X}) + \mathbf{X} \otimes \sigma'(\mathbf{X})\nabla\mathbf{X}</math>


contains an undamped term <math>\nabla\mathbf{X} \otimes \sigma(\mathbf{X})</math> for active gating units, in contrast to the LSTM-style gated tanh unit (GTU) whose gradient is scaled by both <math>\tanh'</math> and <math>\sigma'</math> and therefore vanishes more rapidly with depth. The authors describe the GLU as a multiplicative skip connection: the network can still multiplicatively gate information flow, but the linear path keeps gradients well-conditioned in deep stacks.


Architecturally each block contains up to five layers in a pre-activation bottleneck pattern (a wider <math>k>1</math> convolution sandwiched between two <math>k=1</math> projections) and is wrapped in a residual addition. Models in the paper range from 8 to 14 blocks, with hidden widths of 800–2048 and embedding sizes of 128–280. Training uses Nesterov momentum with momentum 0.99, gradient clipping at 0.1, weight normalization, Kaiming initialization, and learning rates between 1.0 and 2.0. The use of gradient clipping — usually motivated by recurrent gradient explosion — is justified here from a trust-region perspective and substantially speeds up training. Implementation is in Torch on Tesla M40 GPUs, with the largest models trained on 8 GPUs via synchronous data-parallel SGD.


== Results ==


On the '''Google Billion Word''' benchmark, the GCNN-13 reaches 38.1 test perplexity on a single GPU, beating the comparable LSTM-2048 at 39.8. Scaled to 8 GPUs, the GCNN-14 Bottleneck reaches 31.9 perplexity, approaching the 30.6 of the much larger 2-layer LSTM-8192-1024 of Jozefowicz et al. while requiring roughly one third of the GPU-time. On '''WikiText-103''', whose entries are full Wikipedia paragraphs averaging 4000 tokens, the GCNN-14 achieves 37.2 perplexity, a substantial improvement over the LSTM-1024 baseline at 48.7 and the first non-recurrent state of the art on this benchmark. The model also reaches 29.4 perplexity on '''Gigaword''' (versus 55.6 for a fully connected baseline) but underperforms on the small '''Penn Treebank''', where the authors observe overfitting and conclude the architecture is better suited to large-scale problems.


For computational efficiency, a GCNN-8 Bottleneck matches the throughput of a heavily cuDNN-optimized LSTM-2048 (about 45,800 tokens/s on GPU) at the same 43.9 perplexity operating point, while delivering 20× better responsiveness (per-token sequential latency) because each token can be evaluated independently rather than waiting on a recurrent hidden state. The ablation over gating mechanisms shows GLU converges faster and to lower perplexity than GTU, ReLU, or Tanh, and that learning-curve gaps between gated and ungated variants are large and consistent across both datasets. Increasing the receptive field beyond roughly 20 tokens of context yields diminishing returns, supporting the claim that finite contexts suffice for the bulk of practical language modeling.


== Impact ==


The paper is a foundational reference for the broader shift away from purely recurrent sequence models. The gated linear unit it introduced is now a standard activation: GLU and its variants — particularly SwiGLU and GeGLU from the family analysed by Shazeer (2020) — are used in the feed-forward sublayers of large language models such as PaLM, LLaMA, and many open-source transformers, where they consistently outperform plain ReLU or GeLU on perplexity at fixed parameter count.


The architectural argument that parallelizable, finite-context models can rival recurrent ones also helped clear the conceptual ground for the convolutional sequence-to-sequence work of Gehring et al. (2017) at the same lab and ultimately for the '''Transformer''' (Vaswani et al., 2017), which replaced both convolution and recurrence with self-attention while inheriting the parallelism argument and (in many later variants) the GLU activation. Within speech and translation pipelines the latency advantage was directly exploited by subsequent convolutional and gated systems before attention-based decoders became dominant.


A secondary methodological legacy is the paper's clean separation of '''throughput''' (tokens/second under batching) from '''responsiveness''' (per-token sequential latency). This distinction has since become standard when evaluating sequence models for production deployment, where a model with the same training-time throughput as a baseline may nevertheless be unusable if it cannot decode a single sequence quickly enough.


== See also ==


* [[Long short-term memory]]
* [[Convolutional neural network]]
* [[Recurrent neural network]]
* [[Language model]]
* [[Transformer (machine learning model)]]
* [[Attention Is All You Need]]
* [[WikiText-103]]


== References ==


* Dauphin, Y. N.; Fan, A.; Auli, M.; Grangier, D. (2017). "Language Modeling with Gated Convolutional Networks". ''Proceedings of the 34th International Conference on Machine Learning'' (ICML).
* Jozefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; Wu, Y. (2016). "Exploring the Limits of Language Modeling".
* Chelba, C. ''et al.'' (2013). "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling".
* Merity, S.; Xiong, C.; Bradbury, J.; Socher, R. (2016). "Pointer Sentinel Mixture Models" (introduces the WikiText-103 corpus).
* van den Oord, A. ''et al.'' (2016). "Conditional Image Generation with PixelCNN Decoders" (the LSTM-style gated tanh unit baseline).
* Grave, E.; Joulin, A.; Cissé, M.; Grangier, D.; Jégou, H. (2017). "Efficient Softmax Approximation for GPUs" (adaptive softmax).
* Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y. N. (2017). "Convolutional Sequence to Sequence Learning".
* Vaswani, A. ''et al.'' (2017). "Attention Is All You Need".
* Shazeer, N. (2020). "GLU Variants Improve Transformer" (later analysis of GLU-family activations in transformer feed-forward layers).
* Hochreiter, S.; Schmidhuber, J. (1997). "Long Short-Term Memory". ''Neural Computation'' 9(8): 1735–1780.
</translate>

[[Category:NLP]]
[[Category:Research]]
[[Category:Research Papers]]