by Nicholas Carlini 2020-02-20
I have---with Florian Tramer, Wieland Brendel, and Aleksander Madry---spent the last two months breaking thirteen more defenses to adversarial examples. We have a new paper out as a result of these attacks. I want to give some context as to why we wrote this paper here, on top of just “someone was wrong on the internet”. (Although I won't deny that's a pretty big part of the reason.)
The story begins at least two years ago when Anish Athalye, David Wagner, and I wrote a paper studying the adversarial example defenses that appeared at ICLR 2018. (Before that I had performed a number of other attacks, but the field was rather immature at that point; I won't complain about those defenses.) We found that, unfortunately, the vast majority of defenses didn't actualy do anything to improve robustness---they just caused the function to be sufficiently difficult to optimize over that standard approaches couldn't typically find adversarial examples, but slight modifications of standard approaches could. We wrote a paper about this.
Then, in January of last year, with a lot of collaborators, I helped write a paper on how to perform adversarial robustness evaluations based in part by something I had written up online from the year before. This whitepaper went into extreme depth on how we believe is the best way to perform evaluations to adversarial example defenses, ranging from high-level philosophy to details on hyperparameter settings.
Has the state of adversarial example defenses improved over the past two years? Honestly, I was expecting the answer to be “undeniably”.
So Florian, Wieland, Aleksander and I set out to see whether or things had improved by studying thirteen defenses that looked interesting from the last two years. Is there some new and interesting way that defense evaluations are beginning to fail? Has the community learned how to perform rigerous evaluations?
Unfortunately, the answer was no. We could break all of the defenses we selected. And worse, defenses are failing in just the same way as before.
I'm going to start by talking about the progress we saw in evaluations---it's important not to forget that things are better.
Defenses now at least attempt to do proper evaluations. In the past, evaluations were all over the place and often didn't even try to break the proposed defense, instead just showing that existing attacks failed. In contrast, only three of the papers we studied here didn't perform such an adaptive attack and really try and break what they propose. This is a big improvement compared to the past. People are trying to do better.
While all of these papers tried adaptive attacks, they ultimately didn't succeed in finding strong attacks, leading them to believe the defenses were actually robust when instead it is just that the attacks were weak. This isn't really what I'm concerned about, though. My concern is that how these defenses failed is basically the same as the way prior defenses failed. There isn't some new deep lesson to be learned from how these defenses broke. They just weren't evaluated properly the firs ttime around. In particular, these papers often just introduced a method that makes it hard to perform gradient descent during attacks (either intentionally or not), exactly the same flaw that happened to the ICLR'18 defenses.
What has me most worried though, is that the defenses explicitly say in their paper they aren't doing this. This is a known failure mode and so these defenses argue at length why their results aren't just caused by making the optimizers break. In all cases, they are.
Where do we go from here?
I believe in this research direction. I believe that developing robust machine learning is important, and that we should aim to understand and develop robust classifiers.
However, the rate at which broken defenses are accepted to top tier conferences is getting concerning. (There are a lot of papers that go on arxiv that try to defend against adversarial examples. Many have obvious flaws and never get accepted. That is natural and I'm not worried about that.)
I honestly don't know what to do with it. From what I can tell there are three causes of this problem.
First, it truly is difficult to perform proper evaluations. Most of the natural things that Especially for people who are coming from the machine learning space, taking the security mindset is not something that's easy to do.
Second, there is a shortage of qualified reviewers. At each of the most recent ICLR, ICML, and NeurIPS there were between fifty and a hundred defenses submitted. The probability of at least one or two defenses being reviewed by someone who is not an expert in evaluating adversarial example defenses is high. And because of the difficulty in evaluating to begin with, verifying evaluations is even harder. This means that, by pure luck, at least a few defenses will get accepted just because no one was there to strike them down. Not only must we educate everyone in the field on what evaluations require, we also have to educate people not in the field that performing these evaluations is sufficiently difficult and different from standard evauations that it requires additional experience.
Finally, authors are not incentivised to do thorough evaluations. Someone who has spent the last six months building a new defense has no reason to seriously try and break their defense. If they succeeded at breaking it, it wouldn't be publishable: people don't accept negative results. It's hard to convince someone to do something when doing it well makes the outcome worse for them.
Now I don't believe that anyone writing these papers goes in thinking “I'm going to do a bad evaluation, because I know if I do a good one it will get rejected”. But good evaluations take time. And given a bounded amount of time, it can be hard to find the time to spend developing strong attacks on a defense when no one requires them, and the best possible outcome is a negative result.
The biggest concern I have with the current process of getting defenses accepted to conferences is one of scope. From what I can tell, currently the only way to get a defense accepted is to claim that the defense works in all threat models, preferably with very high accuracy. (Important note: here by threat model I don't mean attack. The defender doesn't get to choose the attack that is used. But setting a realistic threat model, that might be rather constrained, is a standard part of computer security.)
There are a number of good papers that introduce defenses under much more restricted threat models, but never get accepted at conferences. If we're going to make progress, I think it's going to be necessary to start accepting (and therefore encouraging more of) these papers. Waiting for the one paper that solves it all isn't going to work.
Because the current reviewer attitude seems to be if the results aren't amazing, and the paper doesn't claim to work under every threat model, then the paper should be rejected. The problem with that is basically any paper that claims strong results on all threat models is just incorrect and broken.
I'd rather see people instead try to do something incremental, well, than submit clearly incorrect “groundbreaking” results.
I mostly wrote this post for myself to force me to think carefully about the state of defense research in adversarial machine learning at the moment. If you noticed I never actually got to answering the question “are defenses improving” it's because I don't think I actually have the answer. There's a good case to be made that defense evaluations are improving: at least people try to attack them. But the fact remains that in almost all cases they're still not attacked correctly.
This clearly is not where we want to be. It would be better if defenses were obviously improving year-over-year and we knew that by just working a bit longer we might have something that works completely. But we're not yet there. And I don't have a solution to that. But I hope soon we will.