Knowledge Distillation
| Article | |
|---|---|
| Topic area | Deep Learning |
| Prerequisites | Cross-Entropy Loss, Softmax Function, KL Divergence |
Overview
Knowledge distillation is a model-compression and knowledge-transfer technique in which a small "student" network is trained to imitate the behaviour of a larger, more accurate "teacher" model rather than learning directly from raw labels. The student's training objective combines, or replaces, the standard supervised loss with a term that pulls the student's output distribution toward the teacher's output distribution on the same inputs. Because the teacher's outputs encode richer information than one-hot labels, including how confident the teacher is and which alternative classes it considers plausible, the student can often reach an accuracy that is unobtainable when training from scratch on the labels alone, while running at a fraction of the compute and memory cost.
The technique was popularised in modern deep learning by Hinton, Vinyals, and Dean in 2015, who framed it as transferring the "dark knowledge" embedded in a teacher's softened logits. It has since become a standard tool in production deep-learning pipelines, deployed wherever a powerful but expensive model needs to be replaced by a cheaper one for inference: mobile vision models distilled from large convolutional ensembles, small language models distilled from frontier Transformer-based teachers, and on-device speech recognisers distilled from server-grade systems. Beyond compression, distillation is used inside training pipelines for self-improvement, ensemble compression, transfer between architectures, and as a regulariser even when the teacher and student are the same size.
Formulation
The canonical formulation considers a classification task with $ K $ classes. Let $ z^t = f^t(x) $ and $ z^s = f^s(x) $ denote the logits produced by the teacher and the student on input $ x $. Hinton's key device is the temperature-scaled Softmax Function:
$ {\displaystyle p_i^{\tau}(z) = \frac{\exp(z_i / \tau)}{\sum_{j=1}^{K} \exp(z_j / \tau)}.} $
A temperature $ \tau > 1 $ softens the distribution, raising the relative probability of non-top classes and exposing the teacher's relative beliefs about them. The distillation loss matches the student's softened distribution to the teacher's:
$ {\displaystyle \mathcal{L}_{\text{KD}}(x) = \tau^2 \, D_{\mathrm{KL}}\!\left(p^{\tau}(z^t) \,\|\, p^{\tau}(z^s)\right),} $
where the $ \tau^2 $ factor compensates for the gradient scaling introduced by dividing logits by $ \tau $, so that the magnitude of the distillation gradient remains comparable across temperatures. The total objective is typically a convex combination with the standard hard-label Cross-Entropy Loss:
$ {\displaystyle \mathcal{L}(x, y) = (1 - \alpha) \, \mathcal{L}_{\text{CE}}(y, p^{1}(z^s)) + \alpha \, \mathcal{L}_{\text{KD}}(x),} $
where $ y $ is the ground-truth label, $ \alpha \in [0, 1] $ trades off the two terms, and the cross-entropy is evaluated at temperature $ 1 $ so that hard-label supervision is not softened. Typical hyperparameters are $ \tau \in [2, 10] $ and $ \alpha \in [0.5, 0.9] $, with values tuned on a validation set.
In the high-temperature limit, expanding the softened softmax shows that minimising the KL Divergence reduces to matching the teacher's logits up to a per-example mean, which gives rise to the older logit-matching variant of distillation due to Bucila, Caruana, and Niculescu-Mizil. At temperature $ 1 $ the distillation term reduces to ordinary cross-entropy against the teacher's predictive distribution, recovering "soft-label training".
Why it works: dark knowledge
The intuition Hinton emphasised is that a confident teacher's near-zero probabilities on incorrect classes still carry information. A model trained on ImageNet may assign a probability of $ 10^{-6} $ to "BMW" and $ 10^{-9} $ to "carrot" when the true label is "garbage truck", and the ratio of these tiny probabilities encodes that BMWs are more truck-like than carrots are. One-hot labels destroy this similarity structure; the teacher's softened distribution preserves it. Training the student to reproduce the full distribution therefore conveys an inductive bias about the geometry of the label space that no labelled example can supply on its own.
A complementary view is that the teacher acts as a smoothed estimator of the Bayes-optimal class-posterior. Where labels are stochastic or ambiguous, the teacher's distribution averages over plausible answers, giving the student a less noisy training signal than the labels themselves. From this angle distillation is a form of Regularization closely related to Label Smoothing: both replace one-hot targets with softer targets, but distillation's targets are input-dependent rather than uniform. Distillation's effective regularisation strength has been formalised in work showing that it is approximately equivalent to a particular Bias-Variance Tradeoff adjustment when the teacher is a calibrated estimator.
Variants
The Hinton-style soft-target loss is now usually called response distillation or logit distillation because the supervision lives at the network output. A second family, feature distillation, instead matches intermediate representations: the student is asked to reproduce the teacher's hidden activations or attention maps, possibly through a learned projection. FitNets, attention transfer, and more recent feature-mimicry losses all fall in this category. Feature distillation can extract more guidance from the teacher when output supervision alone is insufficient, particularly when the architectures differ enough that aligning outputs is too coarse a constraint.
A third family, relation distillation, transfers structural information about how the teacher organises a batch of examples rather than its absolute predictions. Methods such as Relational KD and Similarity-Preserving KD match Gram matrices of activations or pairwise distances between embeddings, which makes the supervision invariant to the precise feature dimensions of the two networks.
Distillation also splits along when the teacher and student are trained. Offline distillation uses a fixed pretrained teacher; this is by far the most common setting. Online distillation trains a cohort of students together, with each student treating an aggregate of the others as a soft teacher, eliminating the need for a separately trained teacher. Self-distillation iterates a single architecture, with the student of one round becoming the teacher of the next, and surprisingly often improves accuracy even when the architecture is held fixed. Born-again networks formalise this iterated self-distillation procedure.
For language models specifically, sequence-level distillation by Kim and Rush adapts the technique to autoregressive generation by training the student to imitate the teacher's beam-search outputs rather than its per-token distribution, which avoids exposure-bias mismatches and is widely used to compress translation and summarisation models. For very large models, distillation underlies many practical small-LM recipes, including DistilBERT, MobileBERT, and the broader practice of producing inference-cheap variants of frontier teachers.
Training and inference
A standard offline distillation pipeline runs as follows. The teacher is trained or downloaded and held frozen. The training loop iterates over the labelled training set and, for each batch, runs both the teacher (in evaluation mode) and the student. The teacher's softened probabilities are precomputed if storage permits, or computed on the fly otherwise; storing teacher logits avoids redundant teacher forward passes across epochs but consumes $ O(N K) $ additional memory for $ N $ training examples. The student is updated by backpropagating the combined loss through its own parameters; the teacher is never updated.
Distillation can be run on the same data the teacher saw, on additional unlabelled data (since soft labels do not require ground truth), or on a held-out transfer set. The unlabelled-data setting is particularly attractive in production: one can scale the student's training set far beyond the labelled corpus by relying on the teacher to provide targets, which is essentially how modern small language models are produced from frontier teachers.
At inference time the teacher is discarded entirely. The student runs as a standalone model with no architectural overhead from the distillation procedure.
Comparisons
Distillation is one of three principal model-compression strategies, alongside Quantization and Pruning. Quantisation reduces the numerical precision of a fixed architecture; pruning removes weights or structures from a fixed architecture; distillation changes the architecture entirely, often replacing a deep wide network with a shallower or narrower one. The three are largely complementary and routinely combined: a frontier teacher is distilled into a smaller architecture, which is then pruned and quantised for deployment. Distillation alone tends to give the largest accuracy-at-fixed-size gains when the original teacher is much larger than what the deployment budget allows, while quantisation and pruning give better gains when the architecture is already close to the right size.
Distillation also has close conceptual cousins outside compression. Co-distillation and online distillation are forms of Ensemble Methods training in which multiple students teach each other; mean-teacher methods in Semi-Supervised Learning are a moving-average form of self-distillation; and policy distillation in reinforcement learning transfers a complex policy into a simpler one using the same machinery applied to action distributions instead of class probabilities.
Limitations
Distillation is not free. It requires a working teacher, which must itself have been trained at some cost, and the student's accuracy is upper-bounded by what the chosen student architecture can in principle represent: a network that lacks the capacity to model the task will not be saved by softer targets. The choice of temperature and loss weight is empirical, and pathological combinations, such as very high temperature with very low $ \alpha $, can produce a student that imitates the teacher's mistakes more faithfully than its successes. When the teacher is poorly calibrated, soft labels can actively harm the student, and distilling from a teacher that has memorised its training set propagates that memorisation into the student.
Feature-level distillation introduces additional brittleness: the projection alignment between teacher and student features is itself a hyperparameter, and aggressive feature matching can over-constrain the student to the teacher's representational idiosyncrasies. For generative models, distillation interacts with the Exposure Bias of autoregressive training in ways that the response-level Hinton loss cannot address, motivating the sequence-level variants. Finally, distillation provides no guarantees about behaviour outside the distribution covered by the transfer set; a distilled student can quietly fail in regions of input space the teacher was never queried on, which is a particular concern for safety-critical deployments and for distilling Large Language Models whose teachers are queried over a vast input space at training time but only a narrow slice at distillation time.
References
[1] [2] [3] [4] [5] [6] [7] [8] [9]
- ↑ Template:Cite arxiv
- ↑ Bucila, C., Caruana, R., and Niculescu-Mizil, A. Model Compression. KDD, 2006.
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv