Synthetic Data’s Role in the Future of AI

As technology improves, AI and ML applications are becoming increasingly pivotal for businesses to stay ahead of their competition. The time will soon come when a business that doesn't leverage AI in its decision making processes will find itself out in the cold. While AI holds a lot of potential, the technology is still nascent and prone to error.

A big reason for this is the so-called "cold start" problem. ML algorithms rely on historical data being fed to them, so they can learn and get better and better at predicting future data patterns.

The challenge here is that most companies lack enough relevant data to feed the algorithms. The data that they do have is either disorganized or isn't of much use in the face of rapidly changing modes of behavior.

Consider this analogy, courtesy of ML engineer Rico Meinl: “Imagine a new member signs up for Netflix. At this point, the company doesn’t know anything about the new members’ preferences. How does the company keep her engaged by providing great recommendations?”

Netflix has millions of past user interactions to lean on when its algorithms present recommendations to newly onboarded audience members, but what about companies that lack access to dynamic, deep archives?

The conundrum is underscored by the moving target that is consumer behavior in unprecedented times.” We are seeing consumers, on the one hand, shift to trusted A brands,” notes Sajal Kohli, a partner at McKinsey. “On the other hand, there is a lot of pervasive promiscuity because consumers have so much choice as they’ve shifted online that their consideration set has expanded quite dramatically."

Synthetic data helps companies overcome this hurdle by helping them feed their ML algorithms with relevant, simulated data. Many companies are adopting the use of synthetic data and are using it to power their next-generation AI algorithms.

Here are some key ways in which synthetic data delivers value in AI application development scenarios.

Greater Data Privacy

Data has become as important as money these days, and consumers are often unwilling to part with it. GDPR and other data privacy laws have ensured that companies cannot play fast and loose with their customers' data.

As a result, AI development has hit major obstacles, since large sets of data cannot be freely used for scenario modeling.

The healthcare sector is an example of how patient confidentiality is as important as developing intelligent AI to detect health issues in patients. Synthetic data offers a solution to AI healthcare providers.

For example, in one 2018 project from the University of Toronto, simulated X-rays generated from real-world data were used to train ML algorithms. These simulated X-rays aren't the result of masking names and identifying information. They're generated from an amalgam of real-world X-ray data. Healthcare professionals can specify parameters to indicate the presence of diseases.

AI can thus be trained to recognize diseases quickly, without compromising confidential data.

Testing Rogue Scenarios

AI's real test lies in its ability to cope with situations that come out of the syllabus, so to speak. An algorithm can deal with situations that closely mimic training data, but it's of no use if the system fails when it confronts the unexpected or worse, does the opposite of what it should do. Synthetic data helps companies generate a ton of scenarios that AI systems can learn from.

"You can create synthetic data for everything, for any use case, which brings us to the most important advantage of synthetic data--its ability to provide training data for even the rarest occurrences that by their nature don’t have real coverage," says Dor Herman, CEO and co-founder of synthetic data provider OneView.

Indeed, the ability to generate random scenarios is critical when testing AI effectiveness. A system that doesn't pass tests has to be retrained, and finding real-world data can be painful.

Synthetic data offers a cheap and quick solution that companies can use to accelerate their development programs.

Prototype Development

As consumers evolve to expect more intelligent solutions, companies are beginning to develop prototypes that automate time-consuming tasks.

A good example of this is Amazon Go, which is a contactless payment system that is implemented in Amazon's grocery stores. Consumers can pick the items of their choice and walk out without having to check out. The cost of their items is deducted online through a digital payment solution.

Developing prototypes of intelligent services like this requires companies to collect a vast amount of data. Small businesses are challenged in this regard because they lack the customer base that companies like Amazon have. Synthetic data offers an elegant solution.

Synthetic data providers generate large datasets based on user-defined parameters and smaller sets of real-world data. As a result, small companies can develop and test their prototypes in cost-effective ways.

San Francisco based startup Standard Cognition is developing an AI-powered checkout system similar to Amazon Go for brick and mortar retailers to use as a service. The result is a cost-effective checkout solution that allows brick and mortar retailers to circumvent Amazon and the data sharing that comes with using Amazon.

As Standard co-founder Jordan Fisher explains, "Amazon’s technology is very expensive. Standard Cognition is essentially a retrofit, so it has to be cheap and flexible enough to easily deploy in an existing store. With Amazon, everything is custom, down to the shelves. That costs millions, and the end result is turning your store into an Amazon Go store in everything but name."

Building Flexibility in Modeling Processes

Real-world data is rigid, and its use is strictly regulated. Aside from data privacy issues, companies have to worry about copyright infringement as well.

Synthetic data helps them sidestep all of these issues and generate as many scenarios as possible without fear of overstepping their boundaries.

As a result, companies can create more robust testing processes at a lesser cost than finding and cleaning real-world data. Their products can be launched to market faster thanks to being trained on a variety of models during the development stage.

Artificial but Impactful

On the surface of it, synthetic data might be dismissed as having limited usefulness. However, thanks to advanced data generation techniques, these data can replicate real-world scenarios with high levels of accuracy.

AI represents a massive leap forward for businesses everywhere, and synthetic data allows companies of all sizes to compete.