The Biggest Challenges in Data Annotation
Data-related tasks consume nearly 80% of the time of AI projects. This makes them a key factor in the machine learning pipeline. Within these data-related tasks, data labeling in particular takes, on average, up to one fourth of the project’s time. Just like the stages that follow related to model development and hyper parameter tuning, the process of data labelling comes with challenges of its own, making it one of the most difficult, time consuming and expensive tasks if not handled in the right manner.
It is often observed in the industry that the task of data labelling is tackled haphazardly by many organizations working to build an AI/ML pipeline, and is also underestimated in its complexity. This is a pitfall that causes inadequate results and may be a contributing factor to the reality that only 8% of firms engage in core practices that support widespread adoption. Most firms have run only ad hoc pilots or apply AI/ML in just a single business process as reported by Harvard Business Review.
So what exactly is it about data labeling that makes it such a challenging task? There are many facets to why exactly this is so, and this article will break them down in detail.
Subject Matter Expertise
Subject Matter Expertise means the amount of domain knowledge or information a labeler has on the data that is being labeled. Fundamentally, data labeling is a task that employs human knowledge at its core, in order to prepare data for a model to train upon in the future.
Often this data is of a nature that can not be accurately labeled without expertise regarding the characteristics and the complexity of the data. This is the primary reason subject matter experts are required for many labeling tasks.
For instance, a task asking annotators to label images of tumors found in MRI scans would be very difficult to comprehend and label by someone who has no medical or radiological knowledge. This type of data would be best understood by an expert radiologist or a doctor.
Consider another case where an organization may want to distinguish faulty architectural blueprints from robust ones. For this task a qualified architect would do the best job in the identification of such blueprints, an unqualified labeler would certainly make many mistakes in this complex decision making process.
The availability and inclusion of subject matter experts becomes the primary challenge of a data labeling task, as not only can these experts be expensive, but are often very hard to access for many organizations due to mutually exclusive domains of operation of the experts and the organization.
Subjectivity and Human Bias
Many machine learning tasks require data that is often subjective. There are sometimes no right or wrong answers; this makes the task inherently fuzzy and up to the labeler’s judgement. This induces human bias into the labels, as the labelers have to follow what seems like the best (or the most logical) answer to them.
More technically this concept is known as the induction of cognitive bias which can manifest itself in various ways, some these being:
- Confirmation Bias: The labelers’ tendency to label data according to information that confirms their existing beliefs. This can be seen in for instance given data related to COVID-19 vaccine effectiveness, where the labeler may use preconceived notions on the vaccine’s effectiveness while labeling the data, confirming his/her already formed beliefs.
- Anchoring Bias: The labelers’ tendency to give higher weight and importance to the data that they encountered early on, or the first piece of information that was relayed to them. For example, often initial examples/samples for a labeling deeply form the definition of labels for the labelers and they tend to follow those trends.
- Functional Fixedness: The labelers’ tendency to look at a specific label for only one side / function / direction of it. For instance, when asked to label “an object to push down nails in an image” they may most likely label a hammer and not a wrench even though the other fulfills the same function.
An example of subjectivity may be given by one of the recent projects handled by Ango AI, which aimed at discerning which frames in a video were most interesting. The use case of labeling the videos was to then summarize them only including the most interesting frames. As one may observe, the importance or significance of a frame depends completely upon the labeler’s discretion. One closely related problem that is caused by this is low consistency, which is another challenge of data labeling.
Another use case can be identified to be scene analysis from a still image. Two labels might give starkly different labels to the same scene. For instance, even observing the image below, one may interpret the man holding the briefcase giving the cogwheel to the robot and the scientist interpreting the results, while others may see it as the scientist programming the robot to give the cogwheel to the man holding the briefcase.
In the simplest terms this can be the manifestation of the phenomenon captured by the widely used proverbial phrase, “whether the glass is half full or half empty?” and that completely depends on who is observing.
Consistency, with regards to data labeling, is the level of agreement that exists for a label among different individuals (or machines) that labeled that specific item (or row) of data. This is specific to the case when multiple labelers are labeling a single piece of data. In general, high consistency is required for quality labeled data. However, maintaining consistency can be fairly challenging, partly due to the reasons of subjectivity and bias discussed above.
Beyond the aforementioned reasons, it is inherently human to make mistakes in tasks requiring judgement/discretion or logic and thus different labels for the same data item arise. This lowers consistency and demands consideration before the data can be delivered.
There are multiple ways to enhance consistency, but some of the most effective ones are the following:
- Review system: Any platform being used for labeling should have an integrated, robust, and effective review system that allows “reviewers” to check labels such that those which are erroneous or significantly inconsistent can be relabeled. This approach, in general, allows for more consistent data.
- Communication: It is very important for the requirements on how the data is to be labeled be communicated clearly, succinctly and effectively to the labelers. This is often done via traditional methods such as workshops, meetings, memos or emails. However an ideal labeling platform should integrate some communication features for both the owners of the data and the labelers so that throughout the labeling process they remain on the same page. This also has a positive effect on labeling consistency as via open communication labelers tend to act in ways clearly demarcated.
With the growing adaptation of outsourcing or crowdsourcing data for labeling it is of utmost importance to ensure the safety, privacy and confidentiality of the data that is being labeled. Unauthorized access, deletion, and storage of data at an unauthorized location are often concerns that need to be addressed by the labeling entity.
Often, organizations choose to have the labeling services on-premise to tackle this problem and ensure that no third party can access the data. This is the most effective way to ensure privacy, however it comes with its own managerial and administrative overhead as managing labels on premises and putting quality assurance measures in place is an extensive process.
The ideal way to tackle this challenge is to ensure that the firm that labels the data complies closely with privacy regulations and processes the data lawfully, fairly and in a transparent manner, keeping all stakeholders informed. This removes the complex layer of workforce and project management, and allows the experts to label the finalized data. Some of the things to look out for within the process of ensuring data privacy are:
- Confidentiality of Data
- Processing data only in accordance with instructions
- Anonymization of personal / sensitive data
- Deletion / Return of data after the processing (labeling) period
Data labeling, especially at large scale as is required today for many use cases, can be extremely challenging, with a variety of facets that need to be addressed. Without addressing these challenges the data may either be low quality (the pitfalls of which were discussed in our “ Quality Assurance Techniques in Data Annotation “ article) or will incur extra layers of complexity and financial overhead.
Often it is best to outsource this task to firms you can trust and those that deliver quality and speed and tackle all these challenges professionally. At Ango AI we provide such a service, ensuring that you get the highest quality, consistent and unbiased data labeled by a handpicked and highly talented team of experts subject to multiple cycles of review. Throughout the process we ensure transparent and effective communication providing initial samples of well labeled data, instructions and the ability for any labeler to report issues within data or the labeling process.
Author: Balaj Saleem
Editor: Lorenzo Gravina
Technical Proofreading: Onur Aydın
Originally published at https://ango.ai on September 16, 2021.