Image Annotation: An Introduction

When people talk about annotating data, there’s one topic that seems to always be the elephant in the room: image annotation.

The domain of image annotation is as vast and old as data science itself. Indeed, one of the first works ever done in the field of AI was to interpret and annotate line drawings. In recent times, however, the focus has evolved substantially. This evolution has been part and parcel of the advent of Big Data and various real-world application areas for computer vision, such as self-driving cars, facial recognition, augmented reality, and surveillance.

Teaching a computer how to see is no easy task. The machine learning model needs to train on images already annotated correctly, so that it may then recognize them on its own and provide meaningful and accurate results and predictions. Image annotation, then, provides an extra burden: the AI team needs to find or produce thousands, if not tens of thousands of correctly annotated images in order to train the model. This is before the model can even be useful.

Models can provide various kinds of outputs. For instance, predicting whether or not an object is present inside an image, creating a rectangular box around an object (commonly called bounding box), or even creating a mask to cover the object itself with pixel-perfect accuracy. Each of these different outputs requires a similar kind of prepared, annotated data to be provided to the model, such that it can learn to do it on its own with accuracy.

We will explore different ways one can prepare this training data. We will do so by going over the most common types of image annotation.

Types of Labeling Tasks

We will delve deep into image annotation. Before that though, let’s go through the various tasks that trained image processing systems can perform.


This type of task usually checks whether a given property exists in an image, or whether a condition is satisfied. In other words, this means classifying an image within a set of predetermined categories based on the contents of the image. Usually, classification is posed as an answer to a question. Such a question may be, for example: “Does the image contain a bird?”

Object Detection

This takes classification one step further by including not only the presence but also the position of the object. Primarily, this finds instance(s) of the object within an image. Detection is primarily a way of getting indicators towards the coordinates of the object within the image. Building up from the previous question of classification, this asks, for example: “Where is the bird in the image?”

Fig. 1. Left, Semantic Segmentation. Right, Instance Segmentation.

Image Segmentation

Put simply, when doing segmentation, the machine learning model breaks the image down into smaller components.

There are two main ways a model can segment an image. In the first, the model assigns a label to a specific “entity” such as a person, a car, or a boat, which has delineated boundaries and is countable. In the second, it labels “areas,” which are not countable and may not have rigid boundaries, such as sky, water, land, or groups of people.

What is commonly called Instance Segmentation is the task of identifying the “entities,” with every pixel that belongs to them, such that the segment captures their shape. Here, one may choose to separate each instance.

On the other hand, Semantic Segmentation requires each pixel of the image to be labeled, such that it not only includes the “entities” but also the “areas”. Most importantly, it does not differentiate between different occurrences of the same object.

Fig. 1. Left, Semantic Segmentation. Right, Instance Segmentation. Source.

Types of Image Annotation

Bounding Boxes

As of right now, this is by far the most common approach to image labeling, as it is the one that most often fulfills the requirements of models processing images. A bounding box is a rectangular area containing an object of interest. The boxes define the location of this object of interest, and a constraint to its size as well.

Each bounding box is a set of coordinates that delineates the starting positions and the ending positions of the object, in all directions. Under the hood, there are two main ways to format such annotations: one uses two pairs of points (x, y) to represent the top right and the bottom-left position of the rectangle. These first two points allow us to extrapolate the other two. The other format only uses one point (x, y) to represent the top right corner of the object, while another tuple (w, h) represents the width and the height of the bounding box.

When do you want to use bounding boxes?
When the primary purpose of your model/system is to detect or localize an object of interest, the range of uses of object detection can range from tasks such as activity recognition, face detection, face recognition, video object co-segmentation, or any similar task.

Polygonal Segmentation

The drawback of bounding boxes is that they cannot fully delineate the shape of the object, only its general position. Polygonal segmentation addresses this problem. The approach relies on drawing a series of points around the object and connecting them to form a polygon around the object. This, although not pixel-perfect in annotation performed by humans, provides adequate data regarding the shape, size, and location of the object.

The polygons are stored in various formats, for example as a list containing a set of points corresponding to the vertices of the polygon. Commonly, this is presented as a list of lists, or using a consecutive ordering of (x, y) points.

When do you want to use polygonal segmentation?
When the system being built is not only to detect or localize the position of an object of interest but also its shape and size. This implies that polygonal segmentation is the way to go for most segmentation tasks.

Fig. 1. Left, Semantic Segmentation. Right, Instance Segmentation. Source.

How can Ango AI help?

Ango AI provides an end-to-end, fully managed data labeling service for AI teams, including image annotation. With our ever-growing team of labelers and our in-house labeling platform, we provide efficient and convenient labeling for your raw data.

Our labeling software allows our annotators to label images with both bounding boxes and polygons in a fast and efficient way. After labeling, our platform also allows for reviewers to verify that our labelers’ work is satisfactory and meets and exceeds our high quality requirements.

Once done, we export the annotations in various formats such as COCO or YOLO, among others, depending on the project.

To bring labeling speed to the next level, these tools will soon be supplemented by smart annotation techniques using AI assistance, drastically reducing the time of such tasks, from minutes to a matter of seconds.

Author: Balaj Saleem
Editor: Lorenzo Gravina
Technical Proofreading: Onur Aydın

Originally published at on June 21, 2021.




Next-gen data labeling solutions.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ango AI

Ango AI

Next-gen data labeling solutions.

More from Medium

Quick Tutorial on Matlab Deep Learning Toolbox Part 1

Realtime Gender and Age Detection Using Wide Residual Networks(WRN)

Releasing Augraphy 7, Announcing Denoising ShabbyPages

Building a simple moodbot using RASA