Data labeling industry / When humans feed the machine

Challenges and opportunities of Artificial Intelligence for Good


Artificial Intelligence is the technology to fulfill such a vision. At the intersection of computer science and data science, AI’s first step is to create a computational representation of everything. Algorithms & Big Data are the two keys. Algorithms that define mathematically how things work and interact together, and Big Data that stores the descriptions of everything. Algorithms and BigData work together to generate models, artificial replicas of the world that machines will be able to process thanks to Machine Learning.

Artificial Narrow Intelligence (ANI), the first phase, addresses this mission by tackling fundamental research problems and applied research problems one by one. Solving very specific and practical issues for each and every industry, ANI gave birth to the 4th industrial revolution which is currently expanding in all sectors : medical, with systems that provide assistance to doctors with automated pathology detection (tumors, glaucomas, fractures …), automotive industry with autonomous vehicles, agriculture with identification of crops issues and pests detection, media (with recommendation systems for music, video, podcasts), retail (chatbot) or in our daily life with assisted governmental policies and smart home assistants, and the list goes on…

A collaboration between Humans and Machines

The main technique to teach a machine (ML) is called Supervised Learning. As for humans, AI machines learn with examples. Supervised Learning methods teach the computers by providing datasets of examples with descriptors, also called “labeled data” or “annotated data”; a combination of inputs and outputs that describes and specifies what is what. A data label may indicate the content of a picture whether it shows objects, people or infrastructure; it may transcribe the words in an audio recording, their context and meaning and overall sentiment, and also what type of action is being performed in a video… Algorithms written by data scientists will then allow the machine to extract the main characteristics and encapsulate the specific knowledge into an overall model.

Unlike humans, machines require myriads of examples to learn from (millions, billions…). Huge amounts of clean and annotated/labeled data are therefore required in order to develop, train and evaluate deep learning algorithms. Data, the raw material, is provided by humans. Human-annotated data is the key to successful machine learning. Data collection, preparation, cleaning and labeling tasks represent over 80% of the time consumed in most AI and Machine Learning projects. Human inputs are required all along the process, from initial training to final validation. Humans are simply better than powerful computers at managing subjectivity, understanding intent and coping with ambiguity.

Millions of people are contributing to this effort. Digital labor platforms or crowd services companies recruit large teams of smart well-trained humans to perform short and repetitive tasks, aka micro-tasks, such as identifying images seen in pictures, annotating videos, text, audio, and spotting errors on mislabeled datasets. The digital workers (aka processors, annotators or Turkers) execute the data annotation within a crowd-working market characterized by the prevalence of short-term contracts or freelance work, typical of the Gig Economy. Data-labeling companies compete against each other to provide fast, accurate and cost efficient solutions to their clients. Time, Cost & Quality are the keys to win in this human-in-the-loop service that works globally around the clock to fine-tune large data sets and make algorithms more accurate.

High quality data is paramount for good machine learning outcomes. The performances of ML & AI models depends on the reliability of the data submitted. Experts in the field are very aware of the notorious GIGO syndrome where mislabeled data lead to errors and lower down the accuracy of the models : “Garbage In, Garbage Out”.

The data labeling process is hard, tedious and timely. To illustrate this, let’s take an example in the automotive industry. In order to train an autonomous vehicle before it hits the road, ML engineers will teach the system with video feeds from real world conditions. A typical street scene could have 30 cars on the road. A video is generally 30 frames/images per second. The labelers would have to label manually each car on every frame. A very experienced labeler could label / draw a box around each vehicle at a pace of one per second. 30f x 30 img x 10sec = 9000 seconds. Labeling a sequence of 10 seconds could take one hour and 40 minutes. Knowing that the industry’s standard required 10 000 hours of labeled data ( with different ontology such as lane, crossroad, traffic light, pedestrian ) to train a vehicle to navigate in a city autonomously, the task is phenomenal.

Novel techniques are emerging to accelerate data labeling with partial automation. Assisted labeling solutions such as Active Learning uses Machine Learning to pre-label datasets along the way and ask the labelers to confirm on the fly the exactitude of the labels. This could reduce by 30% the required time.

The expansion of ML creates a need for more data to be poured into the system, creating an increasing need for human labeled data. An over-demand that might lead to a BigData labeling crisis where the human resources are not enough to answer the challenge. At the DataCouncil conference last year, Jennifer Prendki, Founder & CEO at Alectio, stated: “Even if every single human on the planet would just do nothing but data labeling, it will not be enough manpower to handle the task”.

Toward a responsible AI / work & ethics

As a result, responsible initiatives have emerged to establish better practices, to build a more inclusive global supply chain where a fair ethical principle is not only defined by the respect of national labor laws but also driven by sustainable development goals. The global impact outsourcing coalition GISC composed of buyers and suppliers is leading the effort. Its members are committed to applying better practices. Together they are developing the Impact Sourcing Standards, with tools and methodologies for more social impact, inclusion and diversity. On its agenda are better salary packages, reasonable working hours, insurance/social security, job stability, increase in the responsibilities, constant education & training, family support, accommodation facility, support to local charity organizations.

Merging purpose and profit / AI companies & ethics

In India, INFOLKS employs more than 250 people between 23 & 25 years old of poor conditions. As stated by their CEO Mujeeb Kolasseri their mission is to “Reform our village to a global one by providing economic opportunity to the enthusiastic and dedicated youth from rural area, therefore offering excellent quality of service”

IMerit’s strategy is centered on its employees. About 80% of its 2,000-strong workforce come from families with incomes that are less than $100 (Rs 7,000) a month; about half of them are women. “We have a social mission to create technology employment among underprivileged communities and in territories where there are fewer companies or industries. We operate in cities slightly lesser known for tech and with less technology employment available,” says Natarajan.

In Europe, (HIL) claims to build the next generation of professional humans in the loop from conflict-affected regions and communities. HIL promotes AI which is diverse and bias-free and supports projects which apply AI for social good. HIL provides job opportunities to displaced people and refugees in war-torn countries like Iraq, Turkey, and Syria, and in Bulgaria with refugees, asylum-seekers, and migrants handling projects that need an EU/EEA based workforce. Humans In The Loop provides work for 250 conflict-affected people.

IsaHit is a French company founded by Isabelle Mashola and Philippe Coup Jambet in 2017. Isahit is an outsourcing Platform of digital micro-tasks. IsaHit enable young talents, mainly women from emerging and developing zones (Africa, Asia and South America), to work online part-time as self-employed workers in order to earn additional income*, to carry out their concrete short-term professional projects or to finance the continuation or resumption of their higher education.” *2.84 euros per hour, i.e. up to 284 euros per month (for a maximum of 100 hours of work per month).

The international community of Isahit is composed of more than 1,000 self-employed workers, known as “hiters” (hiteuses in French). It is currently spread over 3 continents in 32 countries. “Our goal for the years ahead is to provide work to 10,000 people in French- and English-speaking Africa while also opening up the platform to people in Asia and Latin America,” summed up Isabelle Mashola CEO.

Conclusion & Future

1/ Volume

More projects = more needs. The adoption of AI in every industry will imply more projects to be developed and more human resources to fulfill them. A recent study on the Artificial Intelligence software market worldwide 2019–2025 forecasts a 54 percent year-on-year growth rate.

2/ Professionalization

As technology evolves, more specific tasks will be required. Within the next few years, all competitive data preparation tools will have ML augmented intelligence as a core part of the offering. As mentioned earlier, over 30% of current labeling tasks will be automated or performed by AI systems by 2024. A larger number of companies will sell pre-trained models or models-as-a-service. However more complex applications will emerge and more specific skill sets will be needed to annotate data. The AI industry will need to invest more on education and training for their present workers.

3/ Expertise AI will evolve from general challenges to narrower applications. As AI will answer deeper problems, more specific knowledge will be required to train and validate the models. An increasingly greater amount of workload requirements will become more domain-specific. More specialists from every area of the industry will be required to contribute to the next generation of applications. As an example, Natural Language Processing will evolve from translating one language to another to specialized translations such as law or science. Computer Vision will transition from cats, cars or people detection to specialized fields such as biology or astronomy. Demand will shift from general knowledge to expert knowledge.

4/ Adaptive

Retraining applications will be required to adapt them to the local environment. AI solutions are not universal. Its fundamental need for data also implies that local data is essential. As an example, an autonomous vehicle trained to drive in the US will require additional training to adapt in a new environment such as Europe or Africa.

5/ Local

Developing applications to answer local challenges is a great opportunity for emerging countries to benefit from AI.

Africa, for instance, has already demonstrated some specifics use-cases in several domains such as agriculture ( usage of drones for more precision agricultural methods, better inspection of individual plants, better monitoring of the fields in order to prevent invasive pests …), finance with AI assisted micro-credits approvals, or healthcare with better, faster and cheaper diagnoses…

AI technology is largely dominated by China, North America & Europe. It will be very difficult for developing countries to compete. However, in the same way that China turned itself into a major provider of new technology after two generations of manufacturing goods for other countries, emerging countries that contribute today with their task force to the data labeling industry could lead the way tomorrow. The real difference will come from how AI technology is applied locally. Emerging countries that invest today into AI for solving local challenges will have the ability tomorrow to invent new successful applications that could spread largely around the world and create the next generation of AI champions.