Synthetic Medical Data: All You Need to Know

6 min readJul 26, 2022

Often, the hardest part of training a ML model is obtaining the right data. This is even more true for fields where the data is especially sensitive and private, such as healthcare. This is why data science teams are now turning to synthetic medical data, so that they are able to create the data they need on the fly.

In this article, we will introduce synthetic medical data, as well as looking at some of its most promising projects, applications, and future possibilities.

If you already have medical data and are looking to label it, check out our introduction to medical image labeling, or our guide to medical file formats. Then, try out Ango Hub, the industry-leading medical data annotation platform, completely for free. Or talk to us about how we can help solve your labeling needs, including a fully-managed medical labeling workforce at your disposal.

The Need for Synthetic Medical Data

Data is the main actor in any machine learning project, however, data specific to one’s needs is hard to obtain and more so annotated data. In the medical domain, the problem of data scarcity for specific use-cases is degrees higher due to numerous concerns that surround the production, transfer, and usage of medical data.

Medical datasets are thus rare in public repositories. The following are some of the many reasons that contribute to this.

The time and availability of medical professionals are expensive.
Creating medical data requires expensive and specialized apparatus.
Privacy concerns over the data and the confidentiality of patients.
Data format and transfer protocols are relatively obscure compared to other domains.

Due to this predicament, the generation of synthetic data can greatly alleviate the availability and cost of medical data. Essentially synthetic data is generated by artificial sources, and not necessarily from the real world. In the machine learning world, synthetic datasets can be used independently in conjunction with real-world data to train models.

Generative Adversarial Networks

While there are various ways to generate synthetic data, by far the most promising approach is employing generative adversarial networks (GANs). Fundamentally GANs offer a viable approach to generating quality synthetic images.

However, it should be noted that apart from generating synthetic data GANs can be used for various other tasks such as domain adaptation, denoising, and modality transfer. However, for the scope of this article, we shall focus on synthetic data generation.

Some Tasks tackled by GANs in the Medical Domain

GANs have two important parts (separate models):

Generator: When given an input vector it generates an output vector (an image). Oversimplified, the goal of the generator is to generate images that resemble the real input images as much as possible
Discriminator: When given a test sample and a real sample it determines whether the test sample is real or not. The discriminator penalizes the generator for producing implausible results.

Over time as the pair, Generator and Discriminator, complement each other in training, each becomes better at generating and discerning synthetic images respectively. Consequently, the generated images become realistic enough to be nearly indistinguishable from real data.

This data that is very similar to real data is then used as synthetic data that can be used for an array of machine learning applications.

GANs to Create Synthetic Medical Data

Synthetic Retina Fundus Images

Description: Medical Scan of the Retina of the eye with blood vessel delineation. Used to detect various eye and vision problems.

Modality: Retina Fundus Imaging.

Repositories:

Papers:

In the example above the input vector to the generator is simply the mask of retina blood vessels. Given this mask, the generator creates realistic retina images.

Synthetic Skin Lesions

GAN Generated Skin Lesion Samples (source)

Description: Abnormalities on the surface of the skin. Used by dermatologists and is often used in the early detection of skin cancer.

Modality: Pictures / Photographs (RGB)

Repositories:

Papers:

Specifically in the case above, given a mask representing the region of the skin lesion, the GAN generates realistic images, representing how the lesions would appear on an actual patient.

Synthetic Mammograms

Artificially Generated Mammograms (source)

Description: An X-Ray scan of the breast, for various diagnostic purposes such as breast cancer detection.

Modality: X-Ray

Paper: High-Resolution Mammogram Synthesis

The image above represents artificially generated mammograms. The authors progressively train GANs to generate higher resolution mammograms starting at 16×16 images up to 1280×1024.

Synthetic Chest X-Rays

Description: Simple X-rays, regeneration has been primarily focused on chest-x rays

Modality: X-Ray

Paper: X-Ray Synthesis

The authors generate synthetic X-Ray images using masks of lungs. The authors further claim that generated images are so realistic that on a Turing test (classifying images as real or fake) a clinician only attains 66% accuracy.

Synthetic Brain MRIs

Description: MR scans of the brain are primarily used to detect tumors and other abnormalities

Modality: MRI

Paper: GAN-based synthetic brain MR

Using tumor masks the authors train a GAN to output realistic Brain MRI that contains such tumors in specific locations.

Some More Promising Examples

While the above examples show the usability of GANs in generating synthetic medical images, there is a lot more research that is advancing this frontier. The following papers are an interesting place to start further research.

Notes and Limitations of Synthetic Medical Data

The usability of GAN generated images goes beyond machine learning. Many applications of these synthetic datasets have been in training physicians. This would greatly help with privacy concerns where data to train physicians is limited.

There is active research on generative models, and GANs are not the only way to generate synthetic images. For example, variational autoencoders, flow based models, diffusion models etc. and all of them may help in generating synthetic medical data.

The advantage of GAN to model performance is not completely certain. According to this paper, For the author’s use case synthetic medical data has not provided an immense advantage in performance. Performance with/without synthetic data remains similar. However, there have been experiments such as this one on synthetic chest x-rays where the authors have documented an increase in accuracy.

A fascinating side note is that visually GAN generated images closely resemble original images and physicians often fail the Turing test to discern machine-generated vs original images, as stated here and here

One fundamental limitation is that GAN architectures are notoriously sensitive to hyperparameters as stated here, making it extremely difficult to get useful synthetic data, often they output sup-par images. Furthermore, image resolution remains an issue in GAN-generated images as apart from this paper majority of the images generated lie below the 1000×1000 threshold.

Conclusion

While an extremely promising avenue a lot remains to be researched to make synthetic medical data more viable for mainstream machine learning projects. However, the pace at which advancements are being made with GANs is phenomenal and in the next few decades, we may see remarkable usage of synthetic medical data.

Until that period, data collected and annotated by qualified individuals remains to be the most accepted form of training data. At Ango AI we ensure that the training data needs of all our partners are met in the best way possible providing tailor-made datasets for their models while maintaining the highest of quality standards.

Author: Balaj Saleem
Technical Proofreader: Onur Aydın

Originally published at https://ango.ai on July 26, 2022.