What is Medical Data Labeling?

Ango AI
6 min readJun 23, 2022


Medicine and Healthcare have long been on the forefront of scientific and technical innovation. According to a recent survey, global healthcare spending stands at nearly $10 Trillion. A considerable fraction of this investment goes into integrating technology into the healthcare industry. One of the most promising avenues at the confluence of healthcare and technology is AI in healthcare. And in training AI models, medical data labeling is essential.

Data flows into the healthcare industry from a myriad of sources, including medical imaging devices, diagnosis documents, visual observations, and health data collection applications. This can exist in visual (image-like) or textual form and can serve clinical, research or administrative purposes. One characteristic of this raw data however is that it lacks structure and labels.

Why Label Medical Data

Raw medical data means little in the world of AI, at least in its present state. 9 out 10 commercial initiatives follow supervised learning approaches, which need well labeled and structured data. With the advent of deep learning this reliance on data, and more importantly on large quantities of quality data is critical.

To deal with this lack of structure and labels, qualified medical professionals need to label the data on a data labeling platform. Models, then, will be able to use this data for training.

Having these ground truth labels for the medical data is thus absolutely essential for their AI applications. Hospitals, universities, and private research institutes are investing time and effort to ingest this labeled medical data and have state of the art models assist the healthcare industry at scale.

How Labeled Medical Data is Used

At the forefront of this AI adoption are researchers who use data in order to further the boundaries of AI innovation and also to set the foundation for commercial and practical usage of the technology.

Since AI is still in its early stages in terms of its maturity for practical / clinical applications, medical researchers are absolutely critical in ensuring steady and well-directed progress of the discipline. We’ll briefly discuss a few examples of such incredible research work:

Following are some examples from Stanford AIMI lab of what AI is able to achieve with labeled medical data.

Clinical Medical Data Labeling

Beyond research. medical professionals use AI in a limited but effective way for prevention, diagnosis and treatment of conditions. Deep Learning models trained on hundreds of thousands of medical images tend to perform comparably in certain specific scenarios to medical professionals. Although they may not completely replace these professionals at the current stage they can certainly assist the process.

Most commercial / clinical applications are in the field of radiology. After a radiologist completes a scan, we send the image to a machine learning / deep learning model, which in turn presents its predictions to the radiologist, utilizing this extra layer of information until they make a more comprehensive diagnosis.

Anomaly detection, which is another key area in machine learning, is being actively integrated into various applications and monitoring systems that collect patient data. The goal being to detect any abnormal behavior or identify individuals who may be at risk.

Types of Medical Data and Labels

Due to the vastness of the medical domain, there are numerous ways we can collect and store data. The following three types, however, occur most frequently:

Fundamentally this data type stores multiple (slices) of medical image information in a single comprehensive volume. Modalities such as CT, MRI, PET scans often store their data in the form of volumes of multiple slices. These volumes can then be projected and labeled in a 3D view or from various directions (sagittal, axial and coronal views). These volumes are correlated spatially and thus store more information both for the diagnosis and model training.

One can label volumes in the following ways based on the use case:

  1. Classification (Indicating presence or absence of a certain attribute in a volume or slice)
  2. Bounding Boxes (Localizing Class and Region of Interest)
  3. Segmentation (Pixel level localization of a class)

As the name implies these are basically 2D images that come in RGB or black and white formats. These include X Rays, Retina Fundus Imaging and Microscopy. These modalities often come independent of relative spatial information.

One can label images in the following ways based on the use case:

  1. Classification
  2. Bounding Boxes
  3. Segmentation

Documents / Text

While the the previous two types of medical data mainly help in solving computer vision problems in the medical domain, textual data is required for the domain of Natural Language Processing (NLP). There are numerous documents and textual artifacts produced by medical institutes that range from structured signals to non structured form entries. In order to make sense of this textual data, we can label it using the following methods:

  1. Classification (of a data sample)
  2. Named Entity Recognition (Identification and localization of elements of interest within a text)
  3. Bounding box (Localization and classification of certain passages or sections within a document)

Document Labeling using Ango Hub

The challenges of Medical Data Labeling

In the medical domain the data collected is extremely personal and thus subject to strong privacy regulations. Thus one of the key factors when it comes to using a streamlined cloud platform for data labeling or outsourcing the whole labeling process is ensuring that data is handled with strong privacy and security regulations.

The way we address this at Ango AI is baking the medical anonymizer service directly into the platform. This way whenever data is uploaded it goes through a layer of anonymity ensuring that all patient / institute specific details are removed before a labeler sees the data.

One of the key challenges of medical data labeling is the requirement of domain expertise to label data. Since medical data is fairly convoluted an untrained labeler often struggles with annotating it in the right manner. This is where the experience and qualifications of radiologists and radiographers come into play. However such annotators are not only much harder to acquire but due to the level of expertise cost considerably higher per hour than normal annotators.

At Ango we ensure a rigorous recruitment process to select capable and experienced medical professionals from the fields of radiology and pathology to deliver the most accurate possible set of labels.

Unlike traditional image formats, medical imaging comes in formats that are much more robust and suited to the needs of medical systems and professionals. The most popular among them are:

This , however, makes these formats comparatively more convoluted and compatibility of data over different platforms is often an issue.

At Ango we ensure that the most popular formats are well supported and often they would be directly imported into Ango Hub. For the formats we do still support, converting is easy.


Medical data labeling is a key factor in producing quality models for AI initiatives in the medical industry. The process of employing AI in healthcare is highly impactful both for research and clinical use.

At Ango AI we provide the necessary tools and a fully managed service to meet all your medical data labeling needs. Through the process we abstract out all the unnecessary details and deal with the challenges that medical labeling entails, ensuring a streamlined experience on our platform and the highest quality labels for your project.

Originally published at https://ango.ai on June 23, 2022.