Active Learning: Less Data, Better Models

When it comes to the world of AI, the word “learning” has a very specific meaning: it is the ability of a system to understand data. In the constantly evolving domain of Machine Learning, there are many learning approaches to cater to different use cases. There are two approaches, however, which are most commonly employed:

  • Supervised Learning: Where the model trains on labeled data. Forming a hypothesis (a very complex function) that allows for predictions on unlabeled data. To this domain belong the tasks of Image Classification, Predictions and Forecasting.
  • Unsupervised Learning: Where the model trains on unlabelled data, delineating hidden structures and patterns that lie within the data. To this belong the tasks of dimensionality reduction and clustering.

However there are many other types of learning that are less explored, such as reinforcement learning or semi-supervised learning. One such type of learning is Active Learning, an approach which is often not in the forefront of learning strategies but one that can be of immense use to many machine learning projects and tasks.

Fundamentally, Active Learning is an approach that aims to use the least amount of data to achieve the highest possible model performance. When following an Active Learning approach, the model chooses the data that it will learn the most from, and then trains on it.

While traditional (passive) supervised machine learning only works by training the model in a single iteration on all training data. The process of Active Learning evolves in several iterations as follows:

  1. Choose initial training data (a small subset of all data)
  2. Train your model on the provided data.
  3. Check where in all the unlabeled data the model is most uncertain.
  4. Label this data using an Oracle (A human or machine that can provide accurate labels)
  5. Repeat steps 2–4 until all data is exhausted, acceptable model performance is achieved, or time / budget constraints are reached.
The Active Learning loop.

Types of Active Learning

Pool Based Active Learning

This is the most popular approach, commonly used when working on Active Learning projects.

The idea is that given a large pool of unlabeled data, the model is initially trained on a labeled subset of it. These training samples are then removed from the pool, and the remaining pool is queried for the most informative data repetitively. Each time data is fetched and labeled, it is removed from the pool and the model trains upon it. Slowly, the pool is exhausted as the model queries data, understanding the data distribution and structure better. This approach, however, is highly memory-consuming.

Stream Based Active Learning

The approach relies on moving through the dataset sample by sample. Each time a new sample is presented to the model, it is determined whether this sample needs to be queried for its label. However since not all of the data is available, the performance over time is often not at par with the pool based approach, as the samples that may be queried may not be optimal, providing the most information for our active learner.


The key to having a successful Active Learning model lies in selecting the most informative / useful samples of data for the model to train on. This process of “choosing” the data which would help a system learn the most is known as querying. The performance of an Active Learning model depends on the querying strategy.

There are many approaches to finding the most informative samples in the data, practically these can vary from case to case, however there are a few which can be adapted to many use cases:

Uncertainty Sampling

Used for many classification tasks, and also known as the 1 vs 2 uncertainty comparison, this approach compares the probabilities of the two most likely outcomes / classes for a given data point. The data points where this value is low are usually the most confusing ones for the model and hence would prove useful to be queried.

This Active Learning strategy is effective for selecting unlabeled items near the decision boundary. These items are the most likely to be wrongly predicted, and therefore, the most likely to get a label that moves the decision boundary.

Another measure that can be used for uncertainty sampling is entropy, which is a measure of “surprise” in a data instance. Points with high entropy are likely to be the most surprising / confusing to the model, therefore knowing the labels for these points would be beneficial for the model.

A theoretical comparison of Active Learning vs supervised learning model performance. (source)

Query by Committee

Query by Committee is a querying approach to selectively sample in which disagreement amongst an ensemble of models is used to select data for labeling.

In other words, an array (committee) of models which may differ in implementations is set up for the same task. As they train, they start to comprehend the structure of data. There are, however, points where the models in this committee are in high disagreement, (i.e. the classes / values assigned to the data point by different models is starkly different) hence these data points are chosen to be labeled by an oracle (usually a human) as they would provide the most information for the models.

Diversity Sampling

As the name suggests, this querying strategy is effective for selecting unlabeled items in different parts of the problem space. If the diversity is away from the decision boundary, however, these items are unlikely to be wrongly predicted, so they will not have a large effect on the model when a human gives them a label that is the same as the model predicted. This is often used in combination with Uncertainty Sampling to allow for a fair mix of queries which the model is both uncertain about and belong to different regions within the problem space.

Top right: One possible result from uncertainty sampling
If all the uncertainty is in one part of the problem space, however, giving these items labels will not have a broad effect on the model.
Bottom left: One possible result of diversity sampling.
Bottom right: One possible result from combining uncertainty sampling and diversity sampling. Adapted from Human-in-the-loop Machine Learning by Robert Monarch.

Active Learning and Data Annotation

As can be observed from the fundamentals of the Active Learning approach, this method reduces the total amount of data needed for a model to perform well. This means that the time and cost that the data labeling process incurs is highly reduced as only a fraction of the dataset is labeled.

However, the tasks of data annotation and model training are often handled separately, and by different organizations. Hence the interaction of both the processes is a challenge that often becomes hard to tackle, owing to the confidentiality and privacy of the data and processes.

Often, Active Learning is used in association with online or iterative learning during the process of data annotation, using Human in the Loop approaches. Active Learning then is responsible for fetching the most useful data and iterative learning, enhancing model performance as the process of annotation continues, and allowing a machine agent to assist humans.

A practical example of this would be using Active Learning for video annotation. In this task, consecutive frames are highly correlated and each second contains a high number (24–30 on average) of frames. Because of this, labeling each frame would be very time- and cost-intensive. It is thus more appropriate to select frames where the model is the most uncertain and label these frames, allowing for better performance with a much lower number of annotated frames.

An intersection of Active Learning and Iterative Learning (Source)

Whether you are a data scientist working on projects that involve labeling vast amounts of data, or an organization that deals with a constant inflow of data that needs to be integrated into their AI system, labeling the right subset of this data for it to be fed to the model would inevitably cater to many of your needs, drastically reducing the time and cost needed to attain a well performing model.

More than 9 researchers out of 10 who have attempted some work involving Active Learning claim that their expectations were met either fully or partially (source).

At Ango AI we work with Active Learning and many more such techniques to ensure that the speed and the quality of our labels is kept as high as possible, employing the latest research in AI assistance. Our focus on improving labeling efficiency via AI assistance has led us to pursue the intersection of Iterative learning and Active Learning and their applications for quality data annotation.

Author: Balaj Saleem
Editor: Lorenzo Gravina
Technical Proofreading: Onur Aydın

Originally published at on October 13, 2021.




Next-gen data labeling solutions.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ango AI

Ango AI

Next-gen data labeling solutions.

More from Medium

Machine Learning #1: History and Meaning

ML model validating using Gaussian Theory

Are you leveraging Association Rule Learning!

Day 3: Lasso Regression

A rope lasso