If the Training Data Sucks, So Does the AI Itself

A man compromises with insomnia and rolls out of bed at 4:30 AM. The sun is hours away from making its daily debut, but it matters not for this man. There’s no need to shave. He hasn’t in four days. He immediately lights a cigarette—a hand-rolled cigarette of unknown (to you) origin. He flicks on the radio. Immediately turns it off. This moment deserves silence. Stares into the mirror. Naked. Buck naked. Looks into himself. Deep into himself. Puts the cigarette out on the back of his hand and flicks it into the toilet. Finally, the words rattling around in his brain slither their way past his lips in exasperated murmur-- “Our training data f*cking sucks.”

And it’s expensive too!

Look, everyone and their grandmother knows AI is huge. Maybe your grandmother probably talks to Snapchat AI more than she talks to you. Either way, while AI certainly provides an entertainment factor, more than anything it can be downright useful. And businesses are adopting AI initiatives at unprecedented pace. I know the world doesn’t need another blog about the growth of AI, but I’ll mix it up in a second.

First, get this: In 1923, only 0% of businesses considered artificial intelligence to be of high priority to their organization. Wow. By 2020, 54% of surveyed IT professionals were highly prioritizing AI. By the end of 2022, that number had climbed to 69% (nice), a 15% increase in just two years.

But, close to half (47%) of AI/ML users have begun their initiatives in the past two years and 78% of the surveyed had moved past the ideation stage into execution. What does this mean? Statistically speaking, there’s a lot of businesses out there running AI programs and initiatives who are total newbies to the field and likely have no idea what they are doing. Which percentage of the 47% are that old dog chemist meme? Well, I can’t answer that for you. What I can tell you is that the biggest reported challenge in companies AI/ML journeys is a shortage of skilled talent (67%), followed by algorithm and model failure (61%). When it comes to adopting AI, the barrier reported most is the cost of implementation. And what takes up the biggest chunk of AI budgets? Sourcing and implementing training data, checking in at 13% of budgets.

A lot of data is just flat out bad. It’s unreliable, difficult to manage, and it’s entirely possible the AI is trained on laundered data, meaning the data used to train the model is sourced from another AI model that was already trained on sketchy data. Shout out to Olga Mack for the intro to this terminology.

So data is bad, it’s expensive, it could be the equivalent to a t-shirt with typos purchased from a thrift store (shout out to my friend’s Nomar “Garciapara” Red Sox shirt), and a gigantic swath of businesses implementing AI are new and lack the resources and talent to make things work, let alone keep it sustainable.

To this end, a whopping 87% of executives are willing to pay more for higher quality training data, while 66% predict their need for training data to only increase compared to 0% of them predicting it to decrease. This is a 0% increase from my make-believe 1923 survey.

More numbers you say? More numbers you’ll receive. In 2022, global spend on artificial intelligence was around $118 billion. By 2026, the number is expected to reach $300 billion. 13% of $300 billion is…$39 billion. Now I know this isn’t exactly how statistics work, so don’t grill me. But in short: global spend for training data for AI is a multi-billion dollar industry. Factor in that 66% of these execs expect the need for training data to increase and 87% are willing to spend more for higher quality data then…well, you get the point.

More factors

On top of this, the ability to source reliable data is way more difficult in 2023 than it was in the past. Privacy initiatives like GDPR and CCPA aim to protect consumer data. Major tech players like Google and Apple are making third-party data collection increasingly more difficult. Ongoing legal battles have AI training data at the forefront, with a popular sentiment being that scraping web data to train AI and claiming it “fair use” is in jeopardy of becoming a thing of the past. An apt comparison may be the Napster fallout of the early 2000s. While it was clearly evident then that Napster was powered by the illegal sharing of copyrighted material and intellectual property, a similar trajectory is something businesses using AI are forced to consider. The sand may filter down through the hourglass, and Metallica’s “To Whom the Bell Tolls” is likely to play for those who haven’t put in the effort to futureproof their AI initiatives.

A new Spotify

So, what is the solution? Well, it’s complicated. But out of the ashes of Napster, Kazaa and Limewire came Spotify, who operated on the premise of building something “better than piracy.” This involved hashing out deals with record labels and agencies to properly license the content streamed on Spotify platforms. Is the same thing possible for AI? We think so. 85% of consumers will exchange data for coupons or discounts. This paves the way for a data acquisition model that incentivizes users to participate, generating valuable zero-party data which can be used for a multitude of things, including training AI. We built something to license zero-party data, and even built a feature in partnership with Snowflake to allow for businesses to re-list licensed zero-party data. Based on the desire for higher-quality training data, this could prove to be a gigantic opportunity for an additional revenue stream that can also built customer loyalty. But enough brand-y stuff. You can learn more here.

In summary…

A lot of training data f*cking sucks. I haven’t dug up the correlation between training data sucking and sales for Gillette razors, but I would imagine there’s something there. On top of it sucking, it’s expensive. More and more companies are dedicating time and resources to implement AI, but many of them are new to the game and lack the proper team, infrastructure, and quality data to optimize their initiatives. Legal battles have thrown a wrench into the “old ways” of AI training data sourcing and collection, and privacy initiatives have made it increasingly difficult for businesses to collect the data needed to fuel their business. Looking to companies like Spotify for inspiration, it’s known to be possible to overcome the legal aspect. Given consumer sentiment on data sharing coupled with a desire for more personalization and customization in their brand experiences, we’ve recognized a giant market for licensing zero-party data for re-sale (among many other use cases). Hey, what’s 13% of $300 billion again?

Written by Shane Faria, co-founder @TIKI