Translations:Language Models are Few-Shot Learners/9/en

GPT-3 uses the same architecture as GPT-2 — a decoder-only transformer with pre-normalization — but scaled to 175 billion parameters across 96 layers, with a hidden size of 12,288 and 96 attention heads. Alternating dense and locally banded sparse attention patterns were used in the layers.