As the expectations from artificial intelligence and machine learning algorithms increase, data annotation has started to become a fundamental necessity for AI projects. Unexpectedly, the data labeling process involves a significant percentage of the AI development process and cost. According to the recent report [1] of Cognilytica on data preparation and labeling, “Data preparation and engineering tasks represent over 80% of the time consumed in most AI and Machine Learning projects”. At the same time, the cost of it is increasing as the scale and complexity of the problem increases. Moreover, it is significant to note that a large amount of data annotation is required not only for training AI models but also for validating AI solutions.
After deciding to initiate an AI project, immediately, the following question arises inside AI teams: “Should we build the data annotation tool in-house and annotate our data internally or buy a solution from a third-party vendor?
Actually, there is not one single answer to this question, and it depends on a couple of different factors like the complexity of annotation, the scale of data, and so on. For this reason, we are listing some main costs of Data Annotation and leaving the answer to you.
What are the costs of Data Annotation?
Annotation Tool Development
Before starting annotating your data, simply, you need an annotation environment. For small and straightforward tasks, people can use various open-source tools such as Computer Vision Annotation Tool (CVAT) or Doccano but, for more complex problems and large datasets, professional tools specialized to the problem must be developed. In addition to that, for large-scale datasets, the annotation tool must have the ability to distribute tasks to a large number of annotators. For example, for an autonomous driving project, a huge amount of data coming from different cameras and sensors should be fused inside an annotation tool to reconstruct a 3D map of an environment.
Besides its core functionality, an annotation tool must be user-friendly to ease the annotators’ job so that large-scale data annotation may happen.
Recruiting Human Workforce
After developing the annotation tool, you need to hire your annotators (most commonly interns). Actually, recruiting only one person is already a heavy burden for companies, and making this on a large scale grows the issue.
According to the report [2] of Cognilytica, for every 1x dollar spent on Third-Party Data Labeling, 5x dollars are spent on internal data labeling effort, since the cost of underutilized human workforce is much higher internally.
For the problem you solved, it might be easy to reach people, but more importantly, after a while, you have to motivate your annotators to do their tasks (and making a payment or raise will not solve the issue at this point).
Reaching Domain Experts
In AI/ML areas, there are a lot of problems that require advanced domain expertise. For instance, tasks like medical image diagnosis, banking, finance, law, insurance applications, natural language processing in a specific language, and fashion recommendation need the professionals of related fields. Reaching out to professionals and hiring them bring along an extra cost to companies.
Preparing Comprehensive Annotation Instructions
During the annotation process, you need to make sure that each annotator is aligned with each other so that compatible annotations are obtained. For this reason, every single bit of detail must be explained comprehensively. On the other hand, since in a large amount of data, there might be an infinite number of outlier cases, and preparing a comprehensive list of instructions may not be straightforward at all.
Distributing Tasks to Annotators
After developing the annotation tool and recruiting annotators, it is time to distribute data to each annotator and start the annotation process. However, this might not be a straightforward process. For example, the abilities of each annotator might differ from each other, and assigning the correct task to the correct annotator might become an important step in data annotation. Also, sorting the data instances according to their label uncertainties and preventing the labeling of similar data instances are not very well-known, but crucially important steps.
Evaluating the Performance of Annotators
Another considerable step of data annotation is evaluating the performance of the annotators. Rather than blaming the annotators for their mistakes, evaluation is significant to assign correct tasks to correct annotators according to the skills of them. In addition to that, human bias, which is a vital and still unsolved problem of AI, can reflect in data while annotating. Therefore, we are starting to solve human bias in AI, at the data annotation level, and evaluating the performance of annotators is playing a crucial role in this task.
Exploring Smarter Ways to Annotate Data
As advances in AI and ML increase, the opportunities to explore novel and smarter ways to annotate data is also increasing. By using these opportunities, the chance of decreasing the duration and cost of annotating large-scale data has been boosted.
Quality
The quality of training data is a crucial factor for highly accurate AI solutions. In order to make sure of the quality of data annotation, there are a couple of things to do. First of all, a single data point (such as a single image, a single document, a single video) can be annotated with various annotators, and the final decision can be made by voting mechanism. Secondly, after annotating data, a separate set of annotators might verify the annotations manually. Even if these approaches increase the scale of the data annotation task, these are the most effective ways to improve the data quality.
In addition, there are smarter and faster ways to control the quality of data annotation. For example, identifying outlier cases by visualizing the data together with their annotation, applying anomaly detection techniques to fix mislabeled data points, and adding AI models for a third voting mechanism are the best techniques to increase the quality of data annotation smartly.
What Ango AI can do for you?
Ango AI provides you the highest quality of training data with a combination of proficient human labelers and cutting edge AI assisted annotation platform. Our tools specialized on text, image, video and document annotation make our human labelers job easy and effective so that you can focus on AI and ML development without a hustle.
Author: Onur Aydın
REFERENCES:
[1] https://www.cognilytica.com/2020/01/31/data-preparation-labeling-for-ai-2020/
[2] https://www.cognilytica.com/2019/03/06/report-data-engineering-preparation-and-labeling-for-ai-2019/
[3] https://www.pexels.com/tr-tr/fotograf/insanlar-kadin-teknoloji-doktor-4226264/