Evolution of The Data Production Paradigm in AI

Today, the success of AI development relies on three key aspects:

🔷 Algorithms, which define the logic of the model.

🔷 Hardware, which is used to run the algorithms.

🔷 Data, which is used by algorithms as a benchmark to train the model.

While the first two have already become commodities (one can use existing open-source libraries and cloud services to run standard algorithms), training data is still a bottleneck in the industry. What’s more, with the other two pillars being equally accessible by any player on the market (anyone can apply the same algorithms and use cloud facilities), training data is the last bastion that ultimately defines the unique features of your AI-driven product.

One can argue that whoever is more effective in managing training data production wins the long-term competition in the business.

Historically, many AI solutions used online relied on algorithms that had been trained on user behavior logs, clicks, and other automatically collected data. But the more AI expands its area of application, and the further it moves into the offline world, the more often we face the situation when the only way to obtain training data is to have it labeled by humans. So, the question of effectively managing training data production is really a question of effectively managing data labeling processes.

Data labeling as an integral part of the AI production pipeline

For an MVP of an AI product, it’s fine and reasonable to train the first model on a data set that was collected quickly and with minimal effort. But for a mature AI product, it is essential to set up the process of continuous improvement, where regular data updating plays a key role.

When we look at the AI production pipeline, we see that data labeling is required at every stage: first, when you collect the training set; then when you validate the quality of the model you trained on it; and then to control how the model actually works in real life after production deployment (this is the stage that many start-ups forget about!).

Here are a few direct and reliable strategies for improving the quality of an AI product by improving its data:

🔷 Collect more data in the training set (more data is always better than less data).

🔷 Regularly update the training set with up-to-date data (the majority of AI applications are subject to context drift, and you can’t build a good AI product in 2021 that’s trained on data collected in, say, 2009).

🔷 Validate the quality of your model and deploy updated models to production faster (those who update their product more frequently develop at a higher pace).

🔷 Regularly control the quality of your AI solution in production.

This means that the long-term success of an AI-based product relies on having the infrastructure for scalable, flexible, and cost-effective data labeling.

Core expertise in modern AI production

While in Kaggle, you compete within a given data set, in the real world, businesses compete in the context of complete production pipelines.

You may have the best engineers and the most incredible computing power, but your model’s success can never exceed the quality of the data you based your model on. As a result, knowing how to build complex data labeling pipelines becomes a prerequisite for success. This is exactly why product management in ML and AI today is essentially all about data management. And since data management backs on data labeling, this is something that you as a company—just like with the ML algorithms themselves—wouldn’t really want to surrender to a third party.

Data labeling production

At Toloka, we aim to provide businesses with an all-inclusive environment for data labeling. It may not seem obvious, but such environments aren’t trivial or easy to build for every particular company. To provide a scalable, on-demand workforce that produces high-quality labels 24/7 in all major languages, a couple of things are necessary:

🔷 A global crowdforce that’s ready to take on tasks immediately after they are posted. This is only possible in an open environment where many different requesters create demand for many different performers. In this way, the former motivates the latter to stay on the platform.

🔷 Methods and tools for automated quality management on a large scale.

People management as an engineering task

One of the key questions is how to set up a process that would effectively use the efforts of millions of independent performers and provide stable and scalable high-quality outcomes resistant to the mistakes of individual performers.

Toloka believes that the key to this challenge is a combination of maths, engineering, and effective business processes. It is the process that needs to be managed, not the people who perform the tasks.

The good news is that with the right decomposition of these processes, appropriate pre-training of crowd workers, and proper aggregation of results, you can achieve high levels of data labeling quality, which you can recreate on a larger scale. The key ingredient is automation: whatever can be automated should be automated.

As an example, let’s take a look at a search relevance data labeling pipeline.

Arguably, search relevance evaluation is one of the most sophisticated tasks in applied machine learning and data labeling: the variety of users’ queries and documents found for these queries is practically infinite. This makes it all the more important to be able to correctly assign tasks to specific labelers who already have the expertise in a certain domain and are able to evaluate content correctly. This pipeline contains multiple levels of evaluation and verification; judgments are explained, and the explanations are verified; confidence of results is estimated automatically, and results with low confidence are re-evaluated.

The trick is that once such a pipeline has been designed, it can function and scale automatically for as long as is needed. Money as an input results as an output. Just like that, the task of managing the efforts of thousands of performers turns from a people management task into a purely mathematical and engineering challenge.

Since Toloka provides data labeling infrastructure for thousands of clients, we can observe how data labeling needs differ between the stages of a business’s development:

🔷 “Mature” AI teams with many experienced engineers seek ultimate flexibility and opportunities to deeply integrate data labeling into their ML production pipelines. For such clients, Toloka offers a wide range of settings and tools via its interface, API, or Python client.

🔷 Clients who are still in the early stages of their AI production seek fast results. They don’t want to put a lot of effort into training their first models, and they go for Apps – pipelines for standard use cases where everything that needs to be thought of (like decomposition, performer training, quality control, and mathematical aggregation of results) is already pre-set to provide the best quality with minimal effort.

Support training data production at every stage

A business’s data labeling requirements change over time. In the beginning, the only things that matter are quality, time, and minimal effort. But with more mature and well-established AI products, scalability and flexibility take center stage.

Data labeling lies at the core of AI production. As with any core part of business, it should not be outsourced: optimizations in the process of data labeling directly affect the efficiency of the whole business.

Toloka provides data labeling infrastructure with its multimillion global crowd of performers, tools for automated quality management, and pre-set solutions for most popular use cases.

Olga Megorskaya - CEO Toloka.