Image segmentation is a computer vision technique that partitions a digital image into discrete groups of pixels—image segments—to inform object detection and related tasks. By parsing an image’s complex visual data into specifically shaped segments, image segmentation enables faster, more advanced image processing.
Image segmentation techniques range from simple, intuitive heuristic analysis to the cutting edge implementation of deep learning. Conventional image segmentation algorithms process high-level visual features of each pixel, like color or brightness, to identify object boundaries and background regions. Machine learning, leveraging annotated datasets, is used to train models to accurately classify the specific types of objects and regions an image contains.
Being a highly versatile and practical method of computer vision, image segmentation has a wide variety of artificial intelligence use cases, from aiding diagnosis in medical imaging to automating locomotion for robotics and self-driving cars to identifying objects of interest in satellite images.
Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.
Register for the ebook on AI data stores
Image segmentation represents an advanced evolution of both image classification and object detection, as well as a distinct set of unique computer vision capabilities.
Image classification applies a class label to an entire image. For example, a simple image classification model might be trained to categorize vehicle images as “car” or “truck”. Conventional image classification systems are limited in sophistication, as they do not process individual image features separately.
Object detection combines image classification with object localization, generating rectangular regions, called “bounding boxes”, in which objects are located: rather than merely labeling a vehicle image as “car” or “truck”, an object detection model could indicate where in the image the car(s) or truck(s) can be found. While object detection can classify multiple elements within an image and approximate each element’s width and height, it cannot discern precise boundaries or shapes. This limits the ability of conventional object detection models to delineate closely bunched objects with overlapping bounding boxes.
Image segmentation processes visual data at the pixel level, using various techniques to annotate individual pixels as belonging to a specific class or instance. “Classic” image segmentation techniques determine annotations by analyzing inherent qualities of each pixel (called “heuristics”) like color and intensity, while deep learning models employ complex neural networks for sophisticated pattern recognition. The outputs of this annotation are segmentation masks, representing the specific pixel-by-pixel boundary and shape of each class—typically corresponding to different objects, features or regions—in the image.
Broadly speaking, image segmentation is used for three types of tasks: semantic segmentation, instance segmentation and panoptic segmentation.
The difference between each type of image segmentation task lies in how they treat semantic classes: the specific categories a given pixel might be determined to belong to.
In the parlance of computer vision, there are two types of semantic classes. Each lends itself to different techniques for accurate, effective segmentation.
Things are classes of objects with characteristic shapes, like “car” or “tree” or “person”. Typically, things have clearly defined instances that are countable. They have relatively little variance in size from once instance to the next, as well as constituent parts distinct from the thing itself: for example, all cars have wheels, but a wheel is not a car.
Stuff refers to semantic classes that are amorphously shaped and highly variable in size, like “sky” or “water” or “grass”. Typically, stuff does not have clearly defined, countable individual instances. Unlike things, stuff does not have distinct parts: a blade of grass and field of grass are both equally “grass”.
Some classes, under certain image conditions, can be both things or stuff. For example, a large group of people can be interpreted as multiple “persons”—each a distinctly shaped, countable thing—or a singular, amorphously shaped “crowd”.
While most object detection efforts focus primarily on thing classes, it’s important to consider that stuff—sky, walls, floors, ground—makes up the majority of our visual context. Stuff is an essential data point for identifying things, and vice versa: a metal thing on a road is usually a car; the blue background behind a boat is probably water, while the blue background behind a plane is probably sky. This is of particular importance to deep learning models.
Semantic segmentation is the simplest type of image segmentation. A semantic segmentation model assigns a semantic class to every pixel, but doesn’t output any other context or information (like objects).
Semantic segmentation treats all pixels as stuff; it does not differentiate between stuff and things.
For example, a semantic segmentation model trained to identify certain classes on a city street would yield segmentation masks indicating the boundaries and contours for each relevant class of thing (like vehicles or light poles) and stuff (like roads and sidewalks), but not make any distinction between (or count the number of) multiple instances of the same class. For example, cars parked one in front of the other might be simply treated as one long “car” segment.
Instance segmentation inverts the priorities of semantic segmentation: whereas semantic segmentation algorithms predict only semantic classification of each pixel (with no regard for individual instances), instance segmentation delineates the exact shape of each separate object instance.
Instance segmentation isolates things from stuff—which it ignores—and can thus be understood as an evolved form of object detection that outputs a precise segmentation mask instead of an approximate bounding box.
It’s a more difficult task than semantic segmentation: even when things of the same class are touching or even overlapping one another, instance segmentation models must be able to separate and determine the shape of each one, whereas semantic segmentation models can simply lump them together. Consider, for example, how the two different models treat the parked cars in this image of a city street.
Instance segmentation algorithms generally take either a two-stage or one-shot approach to the problem. Two-stage models, like Region-based Convolutional Neural Networks (R-CNNs), perform conventional object detection to generate bounding boxes for each proposed instance, then perform more refined segmentation and classification within each bounding box. One-shot models, like YOLO (You Only Look Once), achieve real-time instance segmentation by performing object detection, classification and segmentation simultaneously.
One-shot approaches offer greater speed (with a tradeoff in accuracy), whereas two-stage approaches offer greater accuracy (with a tradeoff in speed).
Panoptic segmentation models both determine semantic classification of all pixels and differentiate each object instance in an image, combining the benefits of both semantic and instance segmentation.
In a panoptic segmentation task, each pixel must be annotated with both a semantic label and an “instance ID”. Pixels sharing the same label and ID belong to the same object; for pixels determined to be stuff, instance ID is ignored.
Panoptic segmentation thus provides computer vision systems a comprehensive, holistic understanding of a given image. While its appeal is obvious, achieving panoptic segmentation in a consistent and computationally efficiency way is an imposing challenge.
The challenge lies in unifying two contradictory methodologies: semantic segmentation models treat all pixels as stuff, disregarding individual instances of things; instance segmentation models isolate individual things, ignoring stuff. Neither type of model can adequately absorb the other’s responsibilities.
Initial attempts at panoptic segmentation models simply combined the two models, performing each task separately and then combining their output in a post-processing phase. This approach has two major drawbacks: it requires a great deal of computational overhead and struggles with discrepancies between data points output by the semantic segmentation network and data points output by the instance segmentation network.
Newer panoptic segmentation architectures aim to avoid these drawbacks with a more unified approach to deep learning. Most are built upon a “backbone” network, like a feature pyramid network (FPN), that extracts features from the input image, feeds that extracted data into parallel branches—like a “foreground branch” and “background branch”, or “semantic head” and “instance head”—and then merges the output of each branch using a weighted system. Proposed panoptic architectures include EfficientPS, OANet, PanopticFPN, UPSNet, SOGNet, BGRNet, AUNet, FPSNet and SpatialFlow.
Traditional image segmentation techniques use information from a pixel’s color values (and related characteristics like brightness, contrast or intensity) for feature extraction, and can be quickly trained with simple machine learning algorithms for tasks like semantic classification.
While deep learning-based segmentation methods are capable of greater precision and more sophisticated image analysis—especially tasks like panoptic segmentation that require a great deal of contextual information—traditional methods are far less costly and computationally demanding, and can solve certain problems more efficiently.
Common traditional (or "classic") image segmentation techniques include:
Trained on an annotated dataset of images, the neural networks of deep learning image segmentation models discover underlying patterns in visual data and discern the salient features most relevant to classification, detection and segmentation.
Despite tradeoffs in computing requirements and training time, deep learning models consistently outperform traditional models and form the basis of most ongoing advancements in computer vision.
Prominent deep learning models used in image segmentation include:
To not only map boundaries of image segments, but also predict which stuff or thing each segment represents, deep learning models are trained on large, annotated datasets to recognize specific semantic classes. From these pre-labeled images, deep learning models infer the patterns and pixel values typical of each label.
Different tasks require different training data: a computer vision system for self-driving cars is trained on images labeled with semantic classes like “person”, “car”, “lane” and “stop sign”, while medical imaging models require optimization for recognizing specific classes of physical tissue, or tumors and other pathologies.
Training datasets must be carefully pre-labeled by human experts, which can be extremely laborious. Most image segmentation models make use of large, open-source datasets. These public datasets also serve as “ground truth” when evaluating success of trained models: performance metrics are often expressed as percentage of results accurately matching the annotations in the training dataset.
Popular training datasets for deep learning image segmentation models include:
Image segmentation has become an essential tool in a variety of fields.
Modern computer vision systems have superhuman accuracy when it comes to image recognition and analysis, but they don’t really understand what they see. At IBM Research, we’re designing AI systems with the ability to see the world like we do.
This beginner-friendly course helps you understand computer vision and its various applications across many industries. As part of the course, you will utilize Python, Pillow, and OpenCV for basic image processing and perform image classification and object detection.
Get started with IBM Maximo Visual Inspection to create and train AI-based models to inspect your assets and products, and identify the defects that can cause production and quality issues.