Instance Segmentation/en
| Article | |
|---|---|
| Topic area | Computer Vision |
| Prerequisites | Convolutional Neural Network, Object Detection, Semantic Segmentation |
Overview
Instance segmentation is the computer vision task of detecting every object instance in an image and producing a pixel-precise mask for each one. Unlike Semantic Segmentation, which assigns a single class label to every pixel without distinguishing between different objects of the same class, instance segmentation separates each individual object: two adjacent cars receive two distinct masks rather than one merged "car" region. Unlike Object Detection, which localizes objects with axis-aligned bounding boxes, instance segmentation delineates exact object boundaries.
The output for an image is a variable-length set of tuples, each consisting of a class label, a confidence score, and a binary mask covering the object. Instance segmentation underpins applications in autonomous driving, medical imaging, robotics manipulation, satellite imagery analysis, and photo editing tools. It is harder than either of its component tasks because the model must simultaneously solve a detection problem (how many objects, of what classes, where) and a fine-grained pixel-labeling problem (which pixels belong to which instance), and must do so in a way that resolves overlapping and occluded objects.
Relationship to Adjacent Tasks
Instance segmentation sits at the intersection of several well-studied problems. Semantic segmentation produces a single label map and cannot count objects. Object detection produces axis-aligned boxes that include background pixels around the object. Panoptic segmentation is a strict generalization that unifies instance segmentation for "things" (countable objects like people and cars) with semantic segmentation for "stuff" (amorphous regions like sky and road), assigning every pixel to exactly one segment.
A useful way to think about the difference: semantic segmentation answers "what is at each pixel," detection answers "where is each object," and instance segmentation answers both at once. Many architectures exploit this by sharing a backbone across tasks and attaching task-specific heads.
Two-Stage Approaches
The dominant paradigm for several years was the two-stage detector extended with a mask head. Mask R-CNN[1] extends Faster R-CNN by adding a small fully convolutional network that predicts a binary mask for each region of interest (RoI). The two stages are: (1) a Region Proposal Network that generates candidate object boxes from the backbone feature map, and (2) per-RoI heads that classify the proposal, refine the box, and predict a fixed-resolution mask (typically 28x28 or 14x14) which is then resized to the box.
A key contribution was RoIAlign, which replaces the quantized RoIPool with bilinear sampling at exact floating-point coordinates. Quantization introduced misalignments of a few pixels that were tolerable for box regression but devastating for masks, where one or two pixels of slop visibly degrade boundaries. RoIAlign improved mask AP by several points without changing the rest of the model.
The mask head decouples mask prediction from classification: it predicts a separate binary mask per class and selects the one corresponding to the predicted class at inference time. The per-RoI mask loss is a binary cross-entropy computed only on the channel for the ground-truth class, which avoids competition between classes.
One-Stage and Real-Time Approaches
Two-stage methods are accurate but slow because each RoI runs through the mask head sequentially. One-stage instance segmentation aims for real-time inference by predicting masks directly from a dense feature map.
YOLACT[2] factors mask prediction into two parts: a set of image-wide "prototype masks" produced once per image, and per-instance coefficients predicted at each detection. The final instance mask is a linear combination of prototypes, then cropped by the predicted box. SOLO[3] reformulates the problem entirely: it predicts an instance category at each grid cell and a corresponding mask, replacing the detect-then-segment pipeline with direct dense mask prediction. CondInst[4] predicts per-instance dynamic convolutional filters that are applied to a shared feature map to produce masks.
Transformer-Based Approaches
Detection transformers reframe instance prediction as set prediction. DETR[5] uses learned object queries that attend to image features and emit a fixed-size set of (class, box) predictions, trained with a Hungarian matching loss that enforces one-to-one correspondence with ground truth, eliminating the need for non-maximum suppression.
MaskFormer and Mask2Former[6] extend this idea to segmentation by having each query predict a class and a mask directly. A single Mask2Former architecture achieves state-of-the-art results on semantic, instance, and panoptic segmentation benchmarks, suggesting that the underlying problem is unified at the architectural level even when evaluated under different metrics. The key insight is that mask prediction is a natural form of attention: each query's mask is the attention map over image features.
Training Objectives
Instance segmentation models are typically trained with a multi-task loss combining classification, box regression, and mask prediction:
$ {\displaystyle \mathcal{L} = \mathcal{L}_{\text{cls}} + \lambda_{\text{box}} \mathcal{L}_{\text{box}} + \lambda_{\text{mask}} \mathcal{L}_{\text{mask}}} $
The mask loss is most commonly per-pixel binary cross-entropy, sometimes augmented with a Dice loss for boundary quality:
$ {\displaystyle \mathcal{L}_{\text{Dice}} = 1 - \frac{2 \sum_i p_i g_i}{\sum_i p_i + \sum_i g_i}} $
where $ p_i $ is the predicted probability and $ g_i \in \{0,1\} $ the ground-truth label at pixel $ i $. Dice loss handles class imbalance better than cross-entropy when the foreground covers a small fraction of pixels, but is noisier on its own; combining the two is standard practice.
Evaluation
The standard metric is mask Average Precision (mask AP), computed analogously to box AP but with intersection over union computed between predicted and ground-truth masks rather than boxes. COCO mAP[7] averages AP over IoU thresholds from 0.5 to 0.95 in steps of 0.05 and over object size buckets (small, medium, large). Boundary-aware metrics such as Boundary AP have been proposed to better reward sharp mask edges, since pixel-IoU saturates well before the mask is visually clean.
Common datasets include COCO (80 classes, ~118k training images), LVIS[8] (1200+ classes with long-tailed distribution), and Cityscapes (urban driving scenes with fine pixel annotations).
Limitations and Open Problems
Instance segmentation inherits the difficulties of detection (small objects, dense scenes, novel classes) and adds segmentation-specific failures: thin structures (wires, animal limbs) are systematically under-segmented because their pixel area is small relative to the loss; mask boundaries are typically a few pixels off from the true boundary because output resolutions are coarse; heavy occlusion confuses both the detector (missed instances) and the mask head (one mask covering two objects). Long-tailed class distributions in datasets like LVIS expose the gap between common-class accuracy and rare-class accuracy, and prompt research into class-balanced losses and sampling.
A separate frontier is open-vocabulary and promptable instance segmentation. The Segment Anything Model[9] demonstrated that a model trained on a billion-mask dataset can produce high-quality masks from point or box prompts on previously unseen categories, decoupling segmentation from a fixed label set. Combined with vision-language models for class assignment, this points toward instance segmentation systems that operate over open vocabularies rather than the closed taxonomies of COCO-era datasets.
References
- ↑ He et al., Mask R-CNN, ICCV 2017.
- ↑ Bolya et al., YOLACT: Real-time Instance Segmentation, ICCV 2019.
- ↑ Wang et al., SOLO: Segmenting Objects by Locations, ECCV 2020.
- ↑ Tian et al., Conditional Convolutions for Instance Segmentation, ECCV 2020.
- ↑ Carion et al., End-to-End Object Detection with Transformers, ECCV 2020.
- ↑ Cheng et al., Masked-attention Mask Transformer for Universal Image Segmentation, CVPR 2022.
- ↑ Lin et al., Microsoft COCO: Common Objects in Context, ECCV 2014.
- ↑ Gupta et al., LVIS: A Dataset for Large Vocabulary Instance Segmentation, CVPR 2019.
- ↑ Kirillov et al., Segment Anything, ICCV 2023.