Adversarial Examples and their implications

Featured: Adversarial examples, future of deep learning, security and attacks

In the “Deep Learning bits” series, we will not see how to use deep learning to solve complex problems as we do in the A.I. Odyssey series**.** We will rather look at different techniques or concepts related to Deep Learning.

Introduction

In this article, we are going to talk about adversarial examples and discuss their implications for deep learning and security_._ They must not be confused with adversarial training, which is a framework for training neural networks, as used in Generative Adversarial Networks.

What are Adversarial Examples?

Adversarial examples are handcrafted inputs that cause a neural network to predict a wrong class with high confidence.

Usually, neural network errors occur when the image***** is either of poor quality (bad cropping, ambiguous, etc.) or contains multiple classes (car in the background etc.). This is not the case for adversarial examples, which look like ordinary images.

*In this post, we will focus on images as they provide interesting visual support, but keep in mind that this can be applied to other inputs such as sound.

While the first two mistakes are understandable, the third image definitely looks like a temple, and we can think that any properly trained neural network should be able to make the correct prediction.

What’s going on here then?

The specificity of adversarial examples is that they do not occur in natural data, they are crafted. An adversarial attack against a neural network is a process in which someone slightly modifies an image so that it fools the network. The goal is to minimize the perturbations to the original image, while obtaining a high confidence for the target class.

Creation of an adversarial example to target the Ostrich class

How is this done?

The generation of adversarial examples is a vast topic, and new techniques are being discovered to create faster, more robust perturbations with minimal image distortion.

We will not dwell on how these are generated to rather focus on their implications. But a general principle and simple method is to take the original image, run it through the neural network, and use the backpropagation algorithm to find out how the pixels of the image should be modified to reach the target class.

Is this really a big deal?

The first thing that we usually think of when we see adversarial examples is that they are unacceptable. As humans would classify them correctly without breaking a sweat, we intuitively expect any good model to do so. This reveals our intrinsic expectations for a neural network: we want human or super-human performance.

“If a model fails to classify this as a Temple, then it’s necessarily bad” — Or is it?

Let’s step back for a minute and think about what this means.

On a given task — e.g. identify road signs in a self driving vehicle — we wouldn’t replace the human by a computer unless it is at least as good as the human.

Something we often forget is that having a model that is better than a human does not imply any requirement on the failure cases. In the end, if the human has an accuracy of 96%, and the neural network of 98%, does it really matter that the exemples the machine missed are considered easy?

The answer to this question is yes… aaaand no.

Even though it’s frustrating and counter-intuitive to see state-of-the-art models fail on what looks like trivial examples, this doesn’t represent a fundamental issue. What we care about is how powerful and reliable the model is. We have to accept that our brain and deep learning do not work in the same way and therefore don’t yield the same results.

“Do we care whether the exemples the machine missed are never missed by a human?”

What does matter, though, is that adversarial attacks represent a security threat to AI-based systems.

How can we maliciously exploit adversarial examples?

Many kinds of Deep Learning powered systems could severely suffer from adversarial attacks if someone got their hands on the underlying model. Here are some examples.

Upload images that bypass safety filters
Create bots that don’t get flagged by google’s “I’m not a robot” system

That’s for the virtual world. However, implementing such attacks on real life objects is significantly harder because of all the transformations involved when taking a picture of an object, but it is still possible.

Robust Adversarial Example in the wild by OpenAI — The red bar indicates the most probable class for the image. Here the cat is classified as a desktop computer

With this in mind you could imagine:

Stealing the identity of someone by wearing special glasses
Misleading a self-driving car by altering traffic signs
Disguise a weapon to avoid video detection
Bypass audio or fingerprint identification

Impersonating Mila Jovovich with custom glasses

What can we do against that?

There are a few things we can do to mitigate this issue. First we can think of keeping the model private. However, this is solution has two big flaws.

First, secure systems should ideally be built following Kerckhoffs-Shannon principle: “one ought to design systems under the assumption that the enemy will immediately gain full familiarity with them”. This means we shouldn’t rely on the privacy of the model because one day or another, it will be leaked*.

Then, some papers have been published on universal/model-independant adversarial attacks, that could potentially work for a specific task no matter which model is used.

*Note: That is the reason why all databases are encrypted. When you think about it, there is no “need” to encrypt your database if you think it will never be hacked.

On the good side, techniques like Parseval networks or Defensive distillation are being developped to make neural networks more resilient to adversarial attacks. We can also train the model using both normal images and adversarial examples to help the network disregard the perturbations.

Extrapolation of OpenAI’s remark on transformations and perturbations magnitude

Also, a team from OpenAI noticed that it becomes increasingly harder to find small perturbations when you want to make the adversarial attack robust to many transformations (rotation, perspective etc.). We can imagine that some model could reach a point where there is no perturbation that is both resilient to all transformations and undetectable. We could then flag adversarial examples as anomalies and therefore be safe from adversarial attacks for this task, but this might be tricky to implement (e.g. requiring multiple cameras etc.).

Conclusion

To conclude, adversarial examples are an incredibly interesting area of deep learning research, and there is progress being made every day to reach secure deep learning systems. It’s paramount that research teams play both cop and robber in trying to make/break neural network classifiers.

As of today, adversarial attacks begin to represent a threat to deep learning-based systems. However, few systems rely blindly on neural networks for important verifications, and adversarial attacks are not versatile/robust enough to be applied at a large scale other than by research teams.

In the years to come, more and more attacks against neural networks will become possible, with increasingly interesting rewards for the attackers Hopefully, we will have developped strong defense techniques to be safe from such attacks by then.

Thank you for reading this post! Feel free to share it and follow me if you like AI related stuff!

Adversarial Examples and their implications - Deep Learning bits #3