Are You Poisoning Your Data? Why You Should Be Aware of Data Poisoning

Machine learning is a remarkable technology. It promises to disrupt business analytics, customer service, hiring, and more. But all this potential comes with some equally weighty concerns. Data poisoning could render these algorithms useless or even harmful.

Governance and security challenges are already the leading obstacles in machine learning deployment today. Data poisoning could introduce new risks, making these challenges even more prevalent and potentially harming machine learning’s adoption.

Businesses need to understand these risks to use this technology safely and effectively. With that in mind, here’s a closer look at data poisoning and what companies can do to prevent it.

What Is Data Poisoning?

Despite its massive potential for disruption, data poisoning is fairly straightforward. It involves injecting misleading or erroneous information into machine learning training data sets. When the model learns from this poisoned data, it’ll produce inaccurate results.

How destructive these attacks are can vary depending on the model and the poisoned data in question. They could be as tame as making an NLP (natural language processing) algorithm fail to recognize some misspelled words. Alternatively, they could jeopardize people’s careers.

Amazon’s infamous recruiting algorithm project hints at what a data poisoning attack could do. The company scrapped the project after realizing the model trained itself to prefer men due to learning from mostly men’s resumes. A cybercriminal could poison a similar algorithm’s training data to create the same result.

Data poisoning can happen with any machine learning model, black box or white box, supervised or unsupervised. While the approaches may vary in each scenario, the overall goal remains the same: inject or alter data in training data sets to compromise algorithm integrity.

Data Poisoning Examples

This threat is more than theoretical, too. Organizations have already experienced data poisoning attacks, and as machine learning gains more prominence, these attacks may become more common.

Data poisoning attacks __date back as far as 2004__when attackers compromised email spam filters. Cybercriminals would inject new data to make these fairly simple algorithms fail to recognize some messages as spam. The criminals could then send malicious messages that flew under these defenses’ radar.

One of the most famous examples of data poisoning occurred in 2016 with Microsoft’s Tay chatbot. The bot, which learned from how people interacted with it, quickly began using inappropriate words and images after users intentionally did the same with it. Tay, by design, taught itself that this was how people spoke, so it adopted the offensive language.

Why Should Data Professionals Be Concerned?

While subpar spam filters and rude chatbots aren’t ideal, they may not seem particularly threatening. However, data poisoning can be far more dangerous, especially as businesses rely more heavily on machine learning.

In 2019, researchers showed how they could poison a street sign identifier torecognize stop signs as speed limit signs. If such an attack targeted self-driving cars, it could cause them to run through stop signs, endangering passengers and other drivers. Similar attacks could cause these machines to fail to see pedestrians, causing collisions and threatening people’s lives.

Data poisoning attacks also become more concerning as automation becomes more common in cybersecurity. Just as early attacks hindered spam filters’ efficacy, new ones could affect intrusion detection models or other security algorithms. Attackers could cause monitoring software to fail to recognize irregular behavior, opening the door to more destructive attacks.

With 32% of organizations altering their long-term strategy in response to data analytics, poisoning can have considerable business implications. Cybercriminals could create misleading analytics models that lead organizations to embrace poor practices, potentially resulting in lost profits and business.

How to Protect Against Data Poisoning

Given how damaging data poisoning can be, data professionals must defend against it. The key to these defenses is preventing and searching for unauthorized access to training data sets.

Businesses must consider their training data sources carefully, reviewing even data from trusted sources before using it. If organizations must move their training data, they should check it before and after to ensure no poisoning occurred in transit. Phased data migration is also ideal, as it creates no downtime, minimizing attackers’ opportunities to inject malicious or misleading data.

Machine learning developers should restrict access to their training data as much as possible and implement strong identification controls. Since more than 80% of hacking incidents uselost or stolen credentials, multi-factor authentication is critical. If teams use on-premises data centers, they must also restrict physical access to server rooms through keycards and security cameras.

Frequent audits can reveal changes to data sets, indicating a poisoning attack. Data professionals should also understand their role in how these models learn, being careful to prevent their own biases from seeping in and unintentionally poisoning the data.

Reliance on Data Raises New Risks

As companies rely more on data and data-centric technologies, data vulnerabilities become more concerning. Machine learning can yield impressive results, but businesses must be careful to make sure no attackers lead their algorithms astray.

Data poisoning can render an otherwise game-changing model useless or even harmful. Data professionals must understand this threat to develop machine learning models safely and effectively.