by Nicholas Carlini 2022-08-17

I recently got back from attending USENIX Security 2022, and someone pointed out to me that it's been five years since I wrote “Towards Evaluating the Robustness of Neural Networks” (with my at-the-time advisor) and they asked if I had any thoughts on this paper. I didn't respond with that great an answer, but thought it was an interesting question, so I figured I'd write one down here instead. (In fact, I got the same question at IEEE S&P earlier this year too, and didn't have a great answer then either.)

Reflections on what went well

I'll start with a few of the things we got right in this paper.

Gradient descent is the right attack

The main thing I think we got right in this paper (and the thing I'm happiest we got right) was our core message: we should evaluate the robustness of neural networks by running attacks that use gradient descent as their foundation.

In retrospect this may sound obvious, but at the time this wasn't the way the field was going. Instead, the trend at the time we wrote our paper was to try and come up with these rather bespoke attacks using clever heuristics for how to fool neural networks. For example, attacks might come up with a complicated loss function and then greedily do something to select pixels to flip that would (approximately, maybe) actually reach a minimum (or maybe not).

And while these other clever attack setups do technically work at generating adversarial examples on undefended networks, it's exceptionally hard to turn them into attacks that actually allow you to evaluate the robustness of a defense. In the same way that “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” (Kernighan), attacking defended neural networks is easily twice as hard as attacking an undefended model, and so if you spend all your cleverness on designing the basic attack you have no hope at actually using it on a defended model.

And so the fact this attack uses gradient descent, a well-studied technique that sees extensive study for use in training neural networks, makes it more likely to succeed at attacking defended models.

Constrained attacks are (still) interesting

While certainly not the first paper to be explicit about framing the adversarial example problem as one of optimizing over a p-norm, I feel our paper had a fairly big impact in setting this up as the canonical way we construct attacks. There were a bunch of papers previously that never really defined what they meant by “close”, or tried to optimize some kind of strange distance function. By defining an attack that could work over all the common distance metrics, it kind of set up these distances as the common ones used.

Thorough evaluation of prior methods

Prior papers in this space often presented a new method without also evaluating against prior papers. In part this is because there weren't many prior papers, but it made it hard to see how well each new paper actually improved on those prior. I didn't want this to become common practice, so I tried hard to really characterize how well we did compared to each of the different attacks across different threat models.

It's sufficiently well written

This is somewhat hard to quantify, but I think we did a good job explaining what was going on in this machine learning field to the (at-the-time) uninformed security researcher. I think the reason we got this right is because it was only a year ago that we were these security researchers uninformed about machine learning---and so it was easy to remember what things we needed to explain carefully.

In fact, it's not until somewhat like a third of the way through the paper that we even introduce our own new attacks; up until that point we're just describing what's been done previously and trying to characterize the existing methods.

We also spent a lot of time making sure to describe things that might otherwise be ambiguous (e.g., do we process logits, or probabilities; do we treat images in the range [0,1], or [0,255]). Most of these decisions are now standardized, but when they weren't it was very helpful to exactly specify what we did.

An interlude on peer review

Because I can't help myself, before I talk about what I think are actual problems in this paper, let me spend just a few minutes talking about the problems raised during peer review.

(So peer review is great and all. It's certainly better than nothing, and it usually works. But every researcher has their story where the peer reviewers made a serious blunder; this here is my story.)

The initial plan for our paper was to submit it to the AISec'16 workshop. We really didn't see this paper as doing anything impressive because the attack was so simple. But then AISec pushed back their submission deadline, and we realized the Euro S&P deadline was just a few days later, and so went to submit there (as a more respected venue) instead.

To say our submission to Euro S&P did not go well would be an understatement. We received two reviews before the rebuttal period, both of which were rather absurd. One of the reviewers left a one sentence review and said something to the effect of “this is not a computer security paper; please resubmit to a machine learning venue” and the other said this paper isn't interesting. I really wish I had saved these preliminary reviews because they were just something else.

We submitted a rebuttal in response, and a month later we got back the final reviews, and the paper was summarily rejected. A few reviewers were happy with the paper, but for the most part the consensus was that our paper wasn't interesting. In particular, we were told our paper apparently has "no real insights", that it "does not provide a systematic approach towards measuring the robustness of neural networks" (and so as a result has "limited novelty" and "makes very limited contributions"), and on the whole "the paper lacks [...] clear motivation" for studying the security of machine learning.

In retrospect, I think we can agree these reviews may not have been the best take.

Reflections on what didn't go well

Now this is not all to say I think our paper had no problems. In fact, on the whole, I would say there are probably more things we could have done better than we actually did well. (There's probably no paper in existence that couldn't say this... that's the nature of research when you're writing at the edge of knowledge.)

Our attack was too complicated

I put a lot of work into squeezing every little bit of performance out of the attack. We used the Adam optimizer for gradient descent, set the default parameters to run for 10,000 iterations of gradient descent, on top of this performed twenty iterations of binary search to tune attack hyperparameters, and used a complicated tanh-based box constraint. We evaluated each of these components in turn, and found that each gave something like a 5% improvement. And so while it was really quite a strong attack, it was both very slow (because it ran so many iterations of gradient descent) but more importantly, it was hard to understand (because it had so many moving pieces).

On the other hand, if you just use straight gradient descent without any of these bells and whistles, you still get an entirely passable attack that works just fine. This is what was done in the followup paper Towards Deep Learning Models Resistant to Adversarial Attacks that introduced the “Projected Gradient Descent” attack. It's simpler, and as a result, better. We could (and should) have done that.

We missed a bunch of prior work

Our paper kind of makes it seem like the field of adversarial machine learning began in 2014. It did not. There are even papers well before that literally titled “Adversarial Machine Learning”. Our paper didn't talk about any of these early results because I honestly hadn't read any of them, for better or for worse.

While there is a perennial debate around how and when to cite prior work, I think it's pretty clear in this case I messed up and should have been more diligent to find the prior papers on this topic. Who knows, I may have even learned a thing or two. Now in this particular case I don't think it would have changed much---the problems we were solving were sufficiently different that they needed new approaches---but it would have been nicer to fit our work into the larger area. Especially given that people sometimes read our paper now as an introduction to the field, it would have been nice to give a more wholistic view of the research space.

Give things a name!

So here's a rather tiny thing we got wrong: we didn't give our attack a name. This was a mistake. In fact, if you go download the source of the paper from arXiv you'll find the following comment in the LaTeX source:

\section{Our Three Attacks} %better title

So yeah. I guess I knew it needed a name at the time, too. I thought I had a good reason not to give it a name though: if all we're doing is applying gradient descent as an attack, is it really right to give it a name calling it something else? At Berkeley we were just calling it the “optimization attack” to differentiate it from everything else that was doing something more ad hoc.

But in retrospect it would definitely have been better to give it a name. PGD was right to win out as the attack method of choice on technical grounds alone---but having a name you could use to refer to it definitely didn't hurt. (Amusingly, I did title one of the attack variants “projected gradient descent”, but couldn't bring myself to call the attack after the projection method.)

Norm-constrained attacks aren't all that matter

Our paper only mentioned norm-constrained attacks. That is, attacks that limit the adversary to only be able to modify an input when constrained to some small pixel-level perturbation of the input. We didn't mention other kinds of attacks in our paper. In retrospect we should have made it clear that other metrics matter too. Just because we focused on p-norm attacks doesn't mean other distances don't matter, and I think we could have done more to encourage future work on finding better distance functions. (Even five years on, we still haven't done this very well.)

For example, in many security critical situations, you don't actually need to take some input and then perturb minimally it to fool a classifier. It's perfectly valid to make (very) large changes to the input if this makes the classifier make a mistake. It's even perfectly valid to just come up with an entirely new input that fools the classifier. We work on constrained attack because they're harder, and if we can succeed at the harder task we can succeed at the weaker task, but we didn't mention this explicitly.

Don't make math mistakes

Our paper has literally one section with a list of equations in it, and I got two of them wrong. Now fortunately these are wrong in fairly trivial ways (constant rescaling fixes things) and we don't actually end up using these equations (other things worked better---maybe because we got these wrong!) but it's just not the best thing to have in the paper.

(I've been asked a number of times, “if you know these errors are in the paper now, why I haven't fixed it?”. Maybe I should. But the reason why I don't is because (maybe archaically) I see a paper as an artifact of what happened at the time it was written, and unless something dramatic like a retraction needs to happen, I would rather present the paper as it existed at the time as opposed to what I wish it had been. Once I start making one kind of edit to a prior paper of mine I don't know where I'd draw the line.)

Philosophizing on early “success”

It's somewhat strange knowing that my most cited work is behind me, and nothing I ever do match this paper no matter how hard I try. You may think I'm being overly dramatic here, but let me assure you I am not: after just five years this paper has somehow become the 6th most cited paper to ever appear at a computer security conference. This kind of success not something one can just decide to replicate, and I will statistically never do this again.

You can only get so lucky in picking a research topic, and when I chose to write a paper on adversarial machine learning in 2016 I was about as lucky as anyone could ever be, because it was exactly the right time. There was only a little bit known about this problem and so there were lots of easy ideas to write down, but not so little was known that the papers were still all confused. It was also right after the deep learning craze, and so despite doing very similar things to what some prior papers had done on SVMs or decision trees, because this was The New Hot Topic that made it exciting. I had also accidentally perfectly prepared myself for writing this paper (by spending the first half of my PhD getting really good at writing attack papers). And while I'll freely admit that there was some amount of skill involved in this paper becoming what it has, I'd give a majority of the credit to luck. If not for me doing this paper, someone else would have done it just as well.

Having something like this happen definitely changed the way I do research. I enjoy just being able to work on whatever random problem I want, regardless of its importance (any. of. these. references. should. convince. you. of. this.). But after having “succeeded” once, people being to expect it in the future. And so, at least in the way of research projects, I've sort of succumbed to this expectation and stopped writing papers that I know aren't “important”. Which isn't to say I always work only on important things, but there's always that nagging at the back of my mind I should be.

Now with that said, it's at the same time really rather relaxing knowing that even if everything else I do fails to be relevant, I'll still have this paper. Having this paper behind me in a way lets me take more risks with the kinds of papers I write.

The other thing this paper has really made me deeply understand is the degree to which citations are truly a terrible metric of impact. This paper is more more highly cited than the attack papers that introduced differential and linear cryptanalysis, return oriented programming, or (any of the) number field sieves---despite each of these other papers being about 100x more important than mine and literally changing the way we think about cryptography and computer security. In contrast, people currently like to cite my paper; the fundamental way we build and use computers isn't any different.

Anyway, with all this out of the way, it's really quite amazing to see how far the field of adversarial machine learning has come in the last five years. We've gone from knowing literally nothing about what it means for models to be robust, to at least now understanding what robustness means even if it's been really hard to actually solve the problem completely. And I'm glad to have been a part of this.

There's also an RSS Feed if that's more of your thing.