Panoptic Segmentation: Everything You Need to Know

Ango AI
9 min readAug 16, 2022

Image segmentation is one of the most widespread data labeling tasks, finding uses in hundreds of different ML applications. Panoptic segmentation is one type of image segmentation, and while one of the most time-intensive, arguably one of the most powerful. In this article, we’ll dive deep into what panoptic segmentation is and how you can use it.

If you need a broader overview of image annotation, check out our complete guide to image annotation. For the medical field, you may check our guide to medical image annotation.

If you need to segment your data and are looking for a data annotation platform to do it, Ango Hub has all you need to start. If instead, you are looking to outsource segmenting your images, Ango Service is what you are looking for. But let’s get back to panoptic segmentation.

What is Image Segmentation?

Image segmentation is the process of labeling an image such that various parts of the image are classified up to the pixel level, making segmentation one of the most information-intensive ways of labeling an image.

Segmented images, along with their segmentation data, can be used to train extremely powerful ML/ Deep Learning algorithms that can provide detailed information regarding what is contained in the image and where.

Image segmentation effectively classifies and localizes objects of interest within an image, making it the labeling task of choice when we need to train highly detailed detectors, and data and resources are available.

Before we delve into the details of various forms of image segmentation. We need to understand the two key concepts to further image segmentation. Any image when segmented can contain two kinds of elements:

Things (Instance): Any countable object is referred to as a thing. As long as one can identify and separate the class into objects comprising it, it is a thing.

To exemplify — person, cat, car, key, and ball are called things.

Stuff (Semantic): Uncountable amorphous region of identical texture is known as stuff. Stuff in general forms an indivisible uncountable region within an image.

For instance, roads, water, sky, etc. would belong to the “stuff” category.

Types of Image Segmentation

Knowing the two concepts mentioned above we can delve into image segmentation. There are three main categories:

Semantic segmentation refers to the task of exhaustively identifying different classes of objects in an image. All pixels of an image belong to a specific class (we automatically consider some unlabeled pixels as belonging to the background class).

Fundamentally, this means identifying stuff within an image.

Instance segmentation refers to that task where we identify and localize different instances of each semantic category. Fundamentally, in instance segmentation each object even though it may belong to the same category gets a different identifier and thus appears as an extension of semantic segmentation.

Instance segmentation thus identifies things in an image

Panoptic Segmentation combines the merits of both approaches and semantically distinguishes different objects as well as identifies separate instances of each kind of object in the input image. It enables having a global view of image segmentation

Essentially, the panoptic segmentation of an image contains data related to both the overarching classes and the instances of these classes for each pixel, thus identifying both stuff and things within an image.

Image Classification, Instance Segmentation, Semantic Segmentation, and Panoptic Segmentation on Ango Hub

The Panoptic Segmentation Format

So how exactly do we achieve maintaining both the semantic and instance categories of the same image? Kirillov at Facebook AI Research and Heidelberg University solved this problem in a very intuitive manner. The following properties exist for panoptic segmentation.

Two Labels per Pixel: Panoptic segmentation assigns two labels to each of the pixels of an image — semantic label and instance ID. The pixels having the same label belong to the same semantic class and instance IDs differentiate its instances.

Annotation File Per Image: As every pixel is labeled and assigned its pixel values, it is often saved as a separate (by convention, png) file with the pixel values, rather than a set of polygons or RLE encoding.

Non-Overlapping: Unlike instance segmentation, each pixel in panoptic segmentation has a unique label corresponding to the instance which means there are no overlapping instances.

Consider the image above and its resultant panoptic segmentation PNG file. The panoptic segmentation is stored as a PNG, with the same dimensions as the input image. This means that masks are not stored as polygons or in RLE format but rather as pixel values in a file.

The image above was a 600 x 400 image, and similarly, the panoptic segmentation is also a 600×400 image. However, while the input image has pixel values in the range 0–255 (grayscale range) the output panoptic segmentation image has a very different range of values. Each pixel value in the resultant panoptic segmentation file represents the class for that pixel.

How Annotations are Stored in the Panoptic Segmentation Format

Let’s dive into some Python to understand how exactly the labels are represented. The key question we want to address is:

For any pixel value in the panoptic segmentation output, what is its corresponding class?

First, let’s check what classes we have:

We find out we have 133 classes in total, representing various categories of objects.

Now let’s go to the panoptic segmentation output. If we get the unique values of the pixels in the panoptic segmentation, we get the following result:

To get the instance and class ids for each of these pixel values here’s how we interpret them:

The instance IDs separate different instances of the same class by a unique identifier. Note that instance IDs are global, i.e. they are not unique for each semantic class, rather the instance ID is a counter for the total instances in the image. In the case above since the highest instance ID is 5, we have 5 thing-instances in total, the rest is stuff.

Mathematically We need to decode these pixel values to get the indices of the classes that they represent. Usually, panoptic segmentation encoding is such that: pixel value % (modulus operator) offset gives us the id of the class.

Because of our mathematical operation above, 2000 % 1000 = 5000 % 1000 = 0 . Thus, we see that pixel value 2000 is actually the same class as pixel value 5000. I.e. They both belong to class 0. Similarly, values 1038 and 3038, belong to the same class of 38.

Correlating our class IDs to the model classes we get the following output. We see that 38 is for tennis_racket, and 0 is for person class, and similarly for other classes. thus answering our initial question of what pixel values correspond to what class in the panoptic segmentation label.

Frameworks for Panoptic Segmentation

Panoptic FPN

Architecture of Panoptic FPN Combining Instance and Semantic Segmentation.

Introduced by the pioneers of Panoptic segmentation, this deep learning framework aims to unify the tasks of instance and semantic segmentation at the architectural level, designing a single network for both tasks.

They use Mask-RCNN initially meant for instance segmentation and add a semantic segmentation branch to it. Each branch uses a Feature Pyramid Network backbone for feature extraction. The FPN extracts and scales up the features such that when encountered in different proportions the network may still detect them correctly.

Surprisingly, this simple baseline not only remains effective for instance segmentation but also yields a lightweight, well-performing method for semantic segmentation. Combining these two tasks the framework sets the foundation for Panoptic Segmentation architectures.


Mask2Former Architecture

Presented in 2022 the authors aim to tackle the problems of instance and semantic segmentation using a single framework thus effectively tackling panoptic segmentation, and advancing the state of the art for panoptic segmentation on various datasets.

The framework is called “Masked-attention Mask Transformer (Mask2Former),” and can address any image segmentation task (panoptic, instance, or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions.

This framework also uses two main branches: a Pixel Decoder branch and A Transformer Decoder branch. The pixel decoder performs a task fairly similar to the FPN discussed above, i.e. to scale up extracted features to various proportions. The transformer decoder uses the various scales of features and the output of the transformer, and combines pixel decoders to predict the mask and class of various objects.

Panoptic Segmentation Datasets

COCO Panoptic

Annotations from the COCO panoptic dataset

The panoptic task uses all the annotated COCO images and includes the 80 thing categories from the detection task and a subset of the 91 stuff categories from the stuff task. This dataset is great for general object detection and you’ll often see it in the panoptic literature to fine-tune networks.


Some Annotations from ADE20k Dataset

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. There are a total of 150 semantic categories, including “stuff” like sky, road, grass, and discrete objects like person, car, and bed.


Some Annotations from the Mapillary Dataset

The Mapillary Dataset is a set of 25000 high-resolution images. The images belong to 124 semantic object categories and 100 instance categories. The dataset contains images from all over the globe covering 6 continents. The data is ideal for panoptic segmentation tasks in the autonomous vehicle industry.


Annotations from the Cityscapes dataset

A dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high-quality pixel-level annotations of 5000 frames, in addition to a larger set of 20 000 weakly annotated frames.

It contains polygonal annotations, combining semantic and instance segmentation with 30 unique classes with data collected from 50 cities.

Panoptic Segmentation in Short

Panoptic Segmentation is a highly effective method of segmenting images effectively, including both semantic and instance segmentation within the task. Although panoptic segmentation is a recent development, the research is fast-paced, and it is pushing the boundaries of object detection further.

Panoptic segmentation is extremely detail-rich due to the pixel-level class labels and can train powerful deep learning frameworks which we have discussed. However, the process of labeling data up to the very pixel level is a grueling one.

At Ango AI we deliver high-quality densely annotated images. Whether you’re looking to deploy a panoptic detector for an autonomous vehicle, a medical imagery task, or other problem, we ensure that our experts label each image carefully up to pixel perfection, using Ango Hub, our state of the art labeling platform natively supporting panoptic labeling.

Book a demo with us to learn how we can help you solve your data labeling needs.

Author: Balaj Saleem
Technical Editor: Onur Aydın

Originally published at on August 16, 2022.