What Is Image Segmentation?

What is image segmentation?

Image segmentation is a computer vision technique that partitions a digital image into discrete groups of pixels—image segments—to inform object detection and related tasks. By parsing an image’s complex visual data into specifically shaped segments, image segmentation enables faster, more advanced image processing.

Image segmentation techniques range from simple, intuitive heuristic analysis to the cutting edge implementation of deep learning. Conventional image segmentation algorithms process high-level visual features of each pixel, like color or brightness, to identify object boundaries and background regions. Machine learning, leveraging annotated datasets, is used to train models to accurately classify the specific types of objects and regions an image contains.

Being a highly versatile and practical method of computer vision, image segmentation has a wide variety of artificial intelligence use cases, from aiding diagnosis in medical imaging to automating locomotion for robotics and self-driving cars to identifying objects of interest in satellite images.

Why AI governance is a business imperative for scaling enterprise AI

Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.

Related content

Image segmentation vs. object detection vs. image classification

Image segmentation represents an advanced evolution of both image classification and object detection, as well as a distinct set of unique computer vision capabilities.

Image classification applies a class label to an entire image. For example, a simple image classification model might be trained to categorize vehicle images as “car” or “truck”. Conventional image classification systems are limited in sophistication, as they do not process individual image features separately.

Object detection combines image classification with object localization, generating rectangular regions, called “bounding boxes”, in which objects are located: rather than merely labeling a vehicle image as “car” or “truck”, an object detection model could indicate where in the image the car(s) or truck(s) can be found. While object detection can classify multiple elements within an image and approximate each element’s width and height, it cannot discern precise boundaries or shapes. This limits the ability of conventional object detection models to delineate closely bunched objects with overlapping bounding boxes.

Image segmentation processes visual data at the pixel level, using various techniques to annotate individual pixels as belonging to a specific class or instance. “Classic” image segmentation techniques determine annotations by analyzing inherent qualities of each pixel (called “heuristics”) like color and intensity, while deep learning models employ complex neural networks for sophisticated pattern recognition. The outputs of this annotation are segmentation masks, representing the specific pixel-by-pixel boundary and shape of each class—typically corresponding to different objects, features or regions—in the image.

Broadly speaking, image segmentation is used for three types of tasks: semantic segmentation, instance segmentation and panoptic segmentation.

Semantic classes: "things" and "stuff"

The difference between each type of image segmentation task lies in how they treat semantic classes: the specific categories a given pixel might be determined to belong to.

In the parlance of computer vision, there are two types of semantic classes. Each lends itself to different techniques for accurate, effective segmentation.

Things are classes of objects with characteristic shapes, like “car” or “tree” or “person”. Typically, things have clearly defined instances that are countable. They have relatively little variance in size from once instance to the next, as well as constituent parts distinct from the thing itself: for example, all cars have wheels, but a wheel is not a car.

Stuff refers to semantic classes that are amorphously shaped and highly variable in size, like “sky” or “water” or “grass”. Typically, stuff does not have clearly defined, countable individual instances. Unlike things, stuff does not have distinct parts: a blade of grass and field of grass are both equally “grass”.

Some classes, under certain image conditions, can be both things or stuff. For example, a large group of people can be interpreted as multiple “persons”—each a distinctly shaped, countable thing—or a singular, amorphously shaped “crowd”.

While most object detection efforts focus primarily on thing classes, it’s important to consider that stuff—sky, walls, floors, ground—makes up the majority of our visual context. Stuff is an essential data point for identifying things, and vice versa: a metal thing on a road is usually a car; the blue background behind a boat is probably water, while the blue background behind a plane is probably sky. This is of particular importance to deep learning models.

Semantic segmentation

Semantic segmentation is the simplest type of image segmentation. A semantic segmentation model assigns a semantic class to every pixel, but doesn’t output any other context or information (like objects).

Semantic segmentation treats all pixels as stuff; it does not differentiate between stuff and things.

For example, a semantic segmentation model trained to identify certain classes on a city street would yield segmentation masks indicating the boundaries and contours for each relevant class of thing (like vehicles or light poles) and stuff (like roads and sidewalks), but not make any distinction between (or count the number of) multiple instances of the same class. For example, cars parked one in front of the other might be simply treated as one long “car” segment.

Instance segmentation

Instance segmentation inverts the priorities of semantic segmentation: whereas semantic segmentation algorithms predict only semantic classification of each pixel (with no regard for individual instances), instance segmentation delineates the exact shape of each separate object instance.

Instance segmentation isolates things from stuff—which it ignores—and can thus be understood as an evolved form of object detection that outputs a precise segmentation mask instead of an approximate bounding box.

It’s a more difficult task than semantic segmentation: even when things of the same class are touching or even overlapping one another, instance segmentation models must be able to separate and determine the shape of each one, whereas semantic segmentation models can simply lump them together. Consider, for example, how the two different models treat the parked cars in this image of a city street.

Instance segmentation algorithms generally take either a two-stage or one-shot approach to the problem. Two-stage models, like Region-based Convolutional Neural Networks (R-CNNs), perform conventional object detection to generate bounding boxes for each proposed instance, then perform more refined segmentation and classification within each bounding box. One-shot models, like YOLO (You Only Look Once), achieve real-time instance segmentation by performing object detection, classification and segmentation simultaneously.

One-shot approaches offer greater speed (with a tradeoff in accuracy), whereas two-stage approaches offer greater accuracy (with a tradeoff in speed).

Panoptic segmentation

Panoptic segmentation models both determine semantic classification of all pixels and differentiate each object instance in an image, combining the benefits of both semantic and instance segmentation.

In a panoptic segmentation task, each pixel must be annotated with both a semantic label and an “instance ID”. Pixels sharing the same label and ID belong to the same object; for pixels determined to be stuff, instance ID is ignored.

Panoptic segmentation thus provides computer vision systems a comprehensive, holistic understanding of a given image. While its appeal is obvious, achieving panoptic segmentation in a consistent and computationally efficiency way is an imposing challenge.

The challenge lies in unifying two contradictory methodologies: semantic segmentation models treat all pixels as stuff, disregarding individual instances of things; instance segmentation models isolate individual things, ignoring stuff. Neither type of model can adequately absorb the other’s responsibilities.

Initial attempts at panoptic segmentation models simply combined the two models, performing each task separately and then combining their output in a post-processing phase. This approach has two major drawbacks: it requires a great deal of computational overhead and struggles with discrepancies between data points output by the semantic segmentation network and data points output by the instance segmentation network.

Newer panoptic segmentation architectures aim to avoid these drawbacks with a more unified approach to deep learning. Most are built upon a “backbone” network, like a feature pyramid network (FPN), that extracts features from the input image, feeds that extracted data into parallel branches—like a “foreground branch” and “background branch”, or “semantic head” and “instance head”—and then merges the output of each branch using a weighted system. Proposed panoptic architectures include EfficientPS, OANet, PanopticFPN, UPSNet, SOGNet, BGRNet, AUNet, FPSNet and SpatialFlow.

Traditional image segmentation techniques

Traditional image segmentation techniques use information from a pixel’s color values (and related characteristics like brightness, contrast or intensity) for feature extraction, and can be quickly trained with simple machine learning algorithms for tasks like semantic classification.

While deep learning-based segmentation methods are capable of greater precision and more sophisticated image analysis—especially tasks like panoptic segmentation that require a great deal of contextual information—traditional methods are far less costly and computationally demanding, and can solve certain problems more efficiently.

Common traditional (or "classic") image segmentation techniques include:

Thresholding: Thresholding methods create binary images, classifying pixels based on whether their intensity is above or below a given “threshold value”. Otsu’s method is often used to determine the threshold value that minimizes intra-class variation.

Histograms: Histograms, which plot the frequency of certain pixel values in an image, are often used to define thresholds. For example, histograms can infer the values of background pixels, helping isolate object pixels.

Edge detection: Edge detection methods identify the boundaries of objects or classes by detecting discontinuities in brightness or contrast.

Watersheds: Watershed algorithms transform images into grayscale, then generate a topographical map in which each pixel’s “elevation” is determined by its brightness. Regions, boundaries and objects can be inferred from where “valleys”, “ridges” and “catchment basins” form.

Region-based segmentation: Starting with one or more “seed pixels”, region-growing algorithms group together neighboring pixels with similar characteristics. Algorithms can be agglomerative or divisive.

Clustering-based segmentation: An unsupervised learning method, clustering algorithms divide visual data into clusters of pixels with similar values. A common variant is K-means clustering, in which k is the number of clusters: pixel values are plotted as data points, and k random points are selected as center of a cluster (“centroid”). Each pixel is assigned to a cluster based on the nearest—that is, most similar—centroid. Centroids are then relocated to the mean of each cluster and the process is repeated, relocating centroids with each iteration until clusters have stabilized. The process is visualized here (link resides outside ibm.com).

Deep learning image segmentation models

Trained on an annotated dataset of images, the neural networks of deep learning image segmentation models discover underlying patterns in visual data and discern the salient features most relevant to classification, detection and segmentation.

Despite tradeoffs in computing requirements and training time, deep learning models consistently outperform traditional models and form the basis of most ongoing advancements in computer vision.

Prominent deep learning models used in image segmentation include:

Fully Convolutional Networks (FCNs): FCNs, often used for semantic segmentation, are a type of convolutional neural network (CNN) with no fixed layers. An encoder network passes visual input data through convolutional layers to extract features relevant to segmentation or classification, and compresses (or downsamples) this feature data to remove non-essential information. This compressed data is then fed into decoder layers, upsampling the extracted feature data to reconstruct the input image with segmentation masks.

U-Nets: U-Nets modify FCN architecture to reduce data loss during downsampling with skip connections, preserving greater detail by selectively bypassing some convolutional layers as information and gradients move through the neural network. Its name is derived from the shape of diagrams demonstrating the arrangement of its layers.

Deeplab: Like U-Nets, Deeplab is a modified FCN architecture. In addition to skip connections, it uses diluted (or “atrous”) convolution to yield larger output maps without necessitating additional computational power.

Mask R-CNNs: Mask R-CNNs are a leading model for instance segmentation. Mask R-CNNs combine a region proposal network (RPN) that generates bounding boxes for each potential instance with an FCN-based “mask head” that generates segmentation masks within each confirmed bounding box.

Transformers: inspired by the success of transformer models like GPT and BLOOM in natural language processing, new models like Vision Transformer (ViT) using attention mechanisms in place of convolutional layers have matched or exceeded CNN performance for computer vision tasks.

Training datasets for deep learning models

To not only map boundaries of image segments, but also predict which stuff or thing each segment represents, deep learning models are trained on large, annotated datasets to recognize specific semantic classes. From these pre-labeled images, deep learning models infer the patterns and pixel values typical of each label.

Different tasks require different training data: a computer vision system for self-driving cars is trained on images labeled with semantic classes like “person”, “car”, “lane” and “stop sign”, while medical imaging models require optimization for recognizing specific classes of physical tissue, or tumors and other pathologies.

Training datasets must be carefully pre-labeled by human experts, which can be extremely laborious. Most image segmentation models make use of large, open-source datasets. These public datasets also serve as “ground truth” when evaluating success of trained models: performance metrics are often expressed as percentage of results accurately matching the annotations in the training dataset.

Popular training datasets for deep learning image segmentation models include:

COCO (Common Objects in Context): a large scale dataset containing over 330,000 images with annotated segments across 80 thing categories and 91 stuff categories.

ADE20K: a scene segmentation dataset created by MIT containing over 20,000 images with over 150 semantic classes.

Cityscapes: a large-scale dataset focused on urban streets. Its data was captured in 50 cities across various daytimes, times of year and weather conditions.

Use cases for image segmentation

Image segmentation has become an essential tool in a variety of fields.

Medical imaging: Image segmentation has many applications across radiography, magnetic resonance imagery (MRI), ultrasounds and computer tomography (CT), aiding tasks like tumor detection, brain segmentation, disease diagnosis and surgical planning.

Autonomous vehicles: Image segmentation allows self-driving cars to avoid obstacles like pedestrians and other cars, as well as identify lanes and traffic signs. It is similarly used to inform navigation in robotics.

Satellite imaging: Semantic and instance segmentation automate the identification of different terrain and topographical features.

Smart cities: Image segmentation powers tasks like real-time traffic monitoring and surveillance.

Manufacturing: in addition to powering robotics tasks, image segmentation powers product sorting and the detection of defects.

Agriculture: image segmentation helps farmers estimate crop yields and detect weeds for removal.