Capsule Networks/en

Article
Topic area	Deep Learning
Prerequisites	Convolutional Neural Network, Backpropagation

Translate this page

Other languages:

Overview

A capsule network (CapsNet) is a neural network architecture in which groups of neurons, called capsules, collectively encode the instantiation parameters of a visual entity rather than emitting a single scalar activation. Each capsule outputs a vector (or matrix) whose length expresses the probability that an entity is present and whose orientation expresses properties such as pose, deformation, color, or texture. Capsules are connected across layers by a dynamic routing mechanism that replaces the scalar weighting of conventional convolutional networks, allowing higher-level capsules to selectively listen to lower-level capsules whose predictions agree.

Capsule networks were introduced by Sabour, Frosst, and Hinton in 2017 as a response to perceived limitations of convolutional architectures: their reliance on max pooling for translation invariance, which discards precise spatial relationships, and their tendency to confuse rotated, mirrored, or otherwise transformed objects. By forcing each capsule to model an explicit pose, a capsule network is intended to be equivariant to viewpoint changes rather than merely invariant, recovering some of the structure that pooling layers throw away.

Although capsule networks remain a niche architecture in production systems, they have influenced research on part-whole hierarchies, equivariant representations, and routing-based attention. Subsequent variants — EM routing, self-routing, stacked capsule autoencoders, and GLOM — explore alternative routing rules and training objectives.

Motivation and Intuition

A standard convolutional layer detects features by computing scalar activations at every spatial location, then aggregates them via pooling. Pooling discards the precise location of a feature in exchange for translation tolerance. Hinton has long argued that this is wasteful: the brain almost certainly retains pose information and relies on the agreement of part poses to recognize wholes, an idea sometimes summarized as "equivariance, not invariance".

A capsule attempts to encode an entity together with its pose. If a face is a whole and the eyes, nose, and mouth are parts, then each part capsule can predict where the face capsule should be — for example, the position of the eyes, given typical facial geometry, predicts the position of the face. When several part capsules predict the same face pose, the network has strong evidence that a face is present. Disagreement among part predictions, by contrast, indicates either a misdetection or a non-face arrangement of features. This coincidence detection is the operative principle behind capsule routing.

Capsule Representation

A primary capsule typically arises from a convolution whose output channels are reshaped into vectors. For a layer of capsules indexed by $$ i $$ , each capsule produces an output vector $u_i \in \mathbb{R}^{d_i}$ . The length $\|u_i\|$ is constrained to lie in $$ [0, 1) $$ by a nonlinearity called the squash function:

$v = \frac{\|s\|^2}{1 + \|s\|^2}\, \frac{s}{\|s\|}$

where $$ s $$ is the pre-activation input vector and $$ v $$ is the squashed output. Short vectors are pushed toward zero; long vectors saturate just below unit length. The norm $\|v\|$ is interpreted as the probability that the entity modeled by the capsule is present, while the direction $v / \|v\|$ encodes its instantiation parameters.

Higher-level capsules consume predictions from lower-level capsules. Each lower-level capsule $$ i $$ contributes a prediction vector $\hat{u}_{j|i} = W_{ij} u_i$ , where $W_{ij}$ is a learned transformation matrix that converts capsule $$ i $$ 's pose into a prediction of capsule $$ j $$ 's pose. The total input to capsule $$ j $$ is a weighted sum of these predictions, with weights determined by routing.

Dynamic Routing by Agreement

Dynamic routing iteratively refines the assignment weights $c_{ij}$ connecting capsule $$ i $$ in one layer to capsule $$ j $$ in the next. Initially, log-prior logits $b_{ij}$ are set to zero. The routing procedure repeats for a small number of iterations (typically 3):

Compute coupling coefficients by softmax over the destination capsules: $c_{ij} = \mathrm{softmax}_j(b_{ij})$ .
Compute each output capsule's pre-activation: $s_j = \sum_i c_{ij}\,\hat{u}_{j|i}$ .
Squash to obtain $v_j = \mathrm{squash}(s_j)$ .
Update logits by agreement: $b_{ij} \leftarrow b_{ij} + \hat{u}_{j|i} \cdot v_j$ .

The dot product $\hat{u}_{j|i} \cdot v_j$ is large when capsule $$ i $$ 's prediction aligns with the consensus output of capsule $$ j $$ , increasing the coupling and reinforcing agreement. Capsules whose predictions disagree see their coupling decay. Routing happens at inference time as well as during training, but the transformation matrices $W_{ij}$ are the only parameters trained by gradient descent.

Loss Function

The original CapsNet uses margin loss, a per-class objective that encourages the corresponding output capsule to have a long vector when the class is present and a short vector when absent:

$L_k = T_k \max(0, m^+ - \|v_k\|)^2 + \lambda (1 - T_k)\, \max(0, \|v_k\| - m^-)^2$

where $$ T_k = 1 $$ if class $$ k $$ is present, $$ m^+ = 0.9 $$ , $$ m^- = 0.1 $$ , and $\lambda = 0.5$ down-weights the absent-class term to prevent early shrinkage. The total loss is summed over classes. A reconstruction decoder, fed the active capsule's vector, optionally adds a small mean-squared reconstruction term to encourage the capsule to retain enough information to redraw the input.

Variants

Several routing rules and architectural extensions have been proposed:

EM routing (Hinton et al., 2018) replaces the dot-product agreement with an expectation-maximization procedure. Pose is represented by a 4x4 matrix and an activation logit; agreement is fit as a Gaussian mixture, which sharpens the routing decision when parts strongly cluster.
Self-routing capsules learn a routing function directly via a small subnetwork instead of iterating to convergence, trading interpretability for speed.
Stacked capsule autoencoders invert the typical setup: each capsule predicts the poses of its parts, and the network is trained unsupervised to reconstruct part templates.
GLOM (Hinton, 2021) generalizes the part-whole idea further, replacing discrete capsules with islands of agreement in a transformer-like representation.
Convolutional capsules apply the same transformation matrices across spatial locations, restoring the parameter-sharing efficiency of convolution.

Comparison with Convolutional Networks

Capsule networks differ from standard CNNs along three axes. First, they replace scalar features with vector poses, so each unit carries structural information that pooling would otherwise erase. Second, they substitute routing-by-agreement for pooling, which is a learned, input-dependent assignment rather than a fixed reduction. Third, they aim for equivariance: when the input is transformed, the capsule outputs change in a structured way rather than remaining identical. CNNs achieve invariance through data augmentation and pooling; capsule networks try to bake equivariance into the architecture.

In practice, on small datasets such as MNIST and smallNORB, CapsNets match or slightly outperform comparable CNNs and recover novel viewpoints more gracefully. On larger datasets such as ImageNet, however, the original architecture has not closed the gap with deep residual networks or vision transformers, in part because dynamic routing is hard to scale and parameterize efficiently.

Limitations and Open Problems

Capsule networks face several practical obstacles. Routing is iterative and not easily parallelized, so per-batch throughput suffers compared with feed-forward layers of similar parameter count. Memory cost grows with the product of capsule counts across adjacent layers because every pair has its own transformation matrix. Training stability is sensitive to the number of routing iterations and to the squash nonlinearity, which is bounded above and can flatten gradients. Finally, the architectural inductive bias toward part-whole hierarchies has so far yielded its clearest benefits on small, structured datasets; whether equivariance can be exploited at the scale of modern foundation models remains an open research question.

Despite these limitations, the conceptual contribution — that representations should encode pose explicitly and that agreement between predictions is a powerful learning signal — continues to inform work on equivariant networks, graph neural networks, and structured attention.

References

Cite error: <ref> tag with name "sabour2017" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "hinton2018em" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "kosiorek2019" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "hinton2021glom" defined in <references> has group attribute "" which does not appear in prior text.