by Nicholas Carlini 2024-01-21
How do I pick what research problems I want to solve? I get asked this question often, most recently in December at NeurIPS, and so on my flight back I decided to describe the only piece of my incredibly rudimentary system that's at all a process. I maintain a single file called ideas.txt, where I just append a new entry every time I think of something that might be an interesting topic for a paper. When it's time to pick my next project, I skim through the list, and pick whichever I think is most interesting. or exciting. or important. or whatever I'm looking for at the moment. (Or find something new entirely if nothing looks compelling.)
But first: this is the system that works best for me; but that means it may not be best for you. In particular, there are a two key factors that make this system uniquely suited to my usecase:
But with that said, here's my process.
Each time I come up with an idea that I think is sufficiently interesting that I could imagine it being a research paper, I add it to my ideas.txt file.
Importantly, only ideas that I think (at the time) are good go in this file. Steven King has a quote about writing that I think applies here: “a writer's notebook is the best way in the world to immortalize bad ideas”.
I think this applies to research too. And so I try to counter this effect by only writing down ideas that I really believe are good, or might become good, at the time.
But there is one key aspect where research is different than writing books: bad research ideas can quickly become good research ideas as progress is made. Something that would be an insane idea five years ago might be a great idea today because new techniques have been developed that make the idea possible.
Writing down ideas really helps me, personally, because I can only work on a small number of things at a time. And when I'm in the middle of working on one project, if I think of something that's really exciting, I can just write it down and get back to the task at hand. This helps me avoid the problem of getting distracted by new ideas, because I know that I won't forget it and will always be able to come back to it later.
I view this file as an append-only write-only log of ideas that I only read through a few times a year. This means I don't structure it in any interesting way, I just add an entry each time I think of something that might be worth doing.
Every once and a while, when I've finished whatever my current projects are, I skim through the last few entries and see if anything that I wrote down at the time is still something I think is interesting. This filtering of ideas is useful because it means I've thought the idea was good at least twice: once when I wrote it down, and once when I read it. And the separation in time lets me evaluate it more objectively. Often times, the idea will no longer be something that's important or interesting. Other times someone will have already solved it. But on more than a few occasions, I've found good ideas that I've turned into fairly nice papers.
Initially I thought about trying to write at a high level about how I go about doing this. But then I couldn't come up with any generic advice that was worth saying. (That's why I'm writing this in January, not December when I was most recently asked how I come up with ideas.)
So below I've decided to just dump my entire ideas.txt file, from when I started it in March of 2016 up through to the end of December 2019. (Stopping at 2019 gives me enough time to reflect back on these ideas with some hindsight. Who knwos, if people find this interesting maybe in a few more years I'll write a followup.)
The rest of this post won't make any sense if you're not someone who is at least somewhat familiar with the field of adversarial machine learning. These were notes to myself, not intended to be somthing I use to explain ideas to others. Sorry about that, but I don't have the time to explain the background for each of these ideas. Where possible I've tried to include links.
If you're not someone who's deeply familiar with this field, let me summarize for you a few of the main takeaways:
Original comments to myself | Notes on the idea as of January 2024 |
Adversarial training:
|
Well I started off with a great idea! Adversarial training is, to date, the best defense we have against adversarial examples. Unfortunately, I didn't actually think of the "right" idea here: I was thinking of using a neural network to generate adversarail examples, but the best way to do this is to use a gradient-based attack. |
Adversarial co-training:
|
This idea basically doesn't work. I tried hard but nothing worked. Such is the way of research. |
average distance required to fool for different classes of models, for different distance metrics:
|
This question of how to evalaute attacks beyond just L_p distances is still mostly open. |
How does regularization change the behavior of models adversarially?
|
These were some good ideas at the time. The answer turns out to be "no, regularization does not help with adversarial robustness" but I don't think it was at all obvious that this would be the case a-priori.
You can see how early on when I wrote things down here I tried to write down how to solve a problem, which I later realized was probably not smart, and instead started shifting to writing down what problems were worth solving. |
Problem: We train on only actual digits, and not on non-digits.
|
Also something that doesn't help much. |
Train a network where the attacker can only run it forward and not backward cryptographically? That is, can we trap-door a network so that it can be evaluated forwards by anyone but backwards only by someone who knows the secret?
This would prevent gradient descent in the idea of https://arxiv.org/pdf/1610.01934v2.pdf |
This was just a bad idea. It won't work. |
Lots of systems to detect adversarial examples, can we break them? | This ended up becoming my 5th most cited paper (as of the time of writing of this post). |
Try to evaluate how malicious training data could mess up your scheme.
|
Well this was also a really good idea! If only I actually had done this I would have been able to claim to have the first first poisoning attacks on deep learning models. |
Can we train a network with a "backdoor" where we take some given values we want it to learn to misclassify and make it memorize those, without impacting the accuracy of other labels? | Also would have been a great idea! BadNets did this later in the year. I remember considering working on this paper in the summer of 2017 and deciding that it probably wasn't actually something that would be interesting ... I guess I should have listened to myself the first time around. |
L0 Attack appears to produce better-than-required attacks with negative score? | I honestly have no idea what this meant. |
Classify spam/ham.
|
I guess we would call this a "clean label" poisoning attack today. |
Monotone classifiers:
|
This ended up becoming the final chapter of my thesis. |
Is it true that more accurate classifiers are harder to fool? Looks like the answer is definitely "no" | I still don't think we know the answer to this well. Lots of papers have been written on the topic though; it's hard to really answer. |
Run L0 attack, find optimal pixels to change. Use those. Now say "you can't touch those pixels". Repeat.
|
I never did this. I still think it would be an interesting question to answer. |
Attacks have two different philosophies:
Should write a brief note comparing these two. Usually it's easy to convert from one to the other. (e.g., can do 2 => 1 with log(n) rounds). Converting 1 => 2 can be tricky, maybe? |
People I think understand this now. It didn't need a paper. |
Current black-box attacks are awful.
|
This is before the ZOO (or any of the other query-based black-box attacks) existed. The question here was, I think, correct: the black-box attacks from 2017 were in fact terrible. But my idea my idea here was wrong on how to fix it. It's kind of surprising how "hard" of an idea it was to just run a white-box attack by making gradient queries. ZOO doesn't get much credit for this anymore today, but it really did change the game of black-box attacks. |
Treat as a crypto problem. We have a function F_k(x)=y and want to learn what F is by asking questions with different x.
|
This question is the foundation for two papers of mine that study model stealing from a cryptographic perspective. Actually, if you read this paper, we motivate it exactly this way! |
Current untargeted attacks are just targeted on the nearest other class
|
Not really important. |
RL adversarial examples using the real RL environment, not making perturbations at the image level.
|
The first papers that finally did a good job on this was only recently published. |
Improving the L2 attack:
|
Still something no one has really done, but not that important. |
Proofs of non-existence for adversarial examples:
|
I tried to do this. It's computationally hard and nothing works. |
Detector defense:
|
I spent a good amount of time on this. I got it working better than the best defenses of the time but never published it becaues I didn't really think it was good enough to be worth doing. In retrospect I was probably wrong here. |
Method to double the security radius:
|
I have recently seen a paper that does this, but I don't remember its name. |
Break captcha:
|
There are still a lot of papers on this topic, and still I don't think many of them are good. If you're reading this: someone who understands ML really should solve the captcha problem once and for all. |
Break document summarization and make it pick a completely wrong summary of something. (e.g., make it take a document about how good something is and make it extract how bad it is instead.)
|
This would also still be an interesting research problem! |
Minimum bounding box adversarial example. Do it in a greedy manner the same way that we do L0 attacks. Initially find an attack using all pixels, then see which border we use less and then re-solve without those border pixels. | I remember writing this down because I still didn't like Lp perturbation attacks, and wanted to come up with something better. And at the time the only better idea I had was bounding box, which ended up becoming a nice paper. |
For limited-dimension subspace attacks (e.g., just cropping/rescaling) try to do "provable" minimal distortion by alternating between finding the point with highest | Uh, I seem to have forgot to complete the sentence. |
Train a GAN by removing the G function and making it just be the adversary. That is, the goal of D is to discriminate between real and fake and instead of using G as a generator, just use optimization attacks. For what purpose? Maybe this will get a stronger discriminator.
|
Crazy idea, probably would never work. |
Train a network that makes it easy to find the best adversarial example.
|
I don't think anyone has done this -- I wonder if it would still work. I'd guess the answer is "no" (and that's why I never tried it, I got tired of breaking all my ideas and having nothing to show for it). |
1. Train a robust neural network that can predict class and "is adversarial"
2. Use this to help GAN training by using this as part of the D network. |
Another crazy idea I kept thinking about and wrote down mainly so that I could feel like I could move on without it bothering me. |
Can we increase robustness on CIFAR-10 by adding CIFAR-100 images to add more data? Need a way to label the CIFAR-100 images: maybe use the unified embedding paper? | Another problem that someone else did better than I was thinking when I wrote this down. |
Can we detect if we're being attacked? That is, we don't want to know if this one image is adversarial, but we want to know if some sequence of queries we've received is indicative of an attack. | Another idea that ended up becoming a paper I wrote! Specifically, it became this paper. |
Numerical estimate of the difficultiness of finding adversarial examples. Maybe measure the effectiveness of finding an adversarial example after K iterations of gradient descent? Look at the slope of the curve? We want something small that's easy to find adversarial examples for, because it means we're probably not masking gradients. Another way: Find the nearest adversarial example of distance d Use K steps with step size a so that K*a = d Compare d=|x-x'| and the length of the path \sum_{i=0}^K |x'_i-x'_{i+1}| | I have no idea what I wanted to do here. |
Computing adversarial examples by taking the mean over random perturbations. | This is now something people do when generating black-box attacks. |
Estimate dimensionality of adversarial subspace with random perturbations | I was specifically thinking that this paper was not the best and wated to improve on it. But I didn't have ideas for how. |
Blackbox attack: try to learn the values that's held secret (e.g., classifying x+δ, classifying T(x), etc)
|
The second time I wrote this idea down; I guess I didn't notice the first time. Well it was good idea that ended up with a nice paper. |
Talk idea: how not to be seen as if we're playing with toys. We are solving hard problems, but we need to motivate it. Safety, Stop the L_P nonsense, Actually do science and perform evaluations properly | I still haven't done this. |
Understand loss in robust training by freezing weights of a pre-trained network and try to see what's different.
|
This would be very interesting! I haven't done it though; it's probably not the most important thing. |
Quantifying linearity of a neural network For some normal point x and then some other point x' close to x, compute mean[ sim(∇(f,x_i), ∇(f,x_{i+1})) ] for i,i+1 interpolating from x to x'
|
I have no idea why I'd want to do this. |
Attack wikipedia comment spam detector or google chrome extension for comment spam "Tune"
|
Would still be fund to do this. I haven't had the time but I'd like to know the answer |
Training poisoning DoS. Insert some training points that make models diverge. This should be easy if the input domain is unconstrained. | Still no one knows how to do this. It's a surprisingly hard problem... |
Re-visit Same-class poisoning. This is now called a clean-label back-door I believe. | I haven't done this still. |
What is the volume of the region that can be reached with a path of high confidence followed by a jump of size epsilon? Start with an image X classified with F(x)_y > t. Find a path p so that p(0) = x, p(1) = x' and so that ∀ alpha ∈ [0,1] F(p(alpha)) > t. What fraction of images x' can be reached in such a way? "Mode Connectivity" Freeman and Bruna 16, Garipov 18, Draxler 18 | I thought this was interesting at the time, but don't any more. I don't know why I would care about this really. |
A stricter adversarial example problem: find a pair of images x, x' so that |x-x'| is small but f(x)-f(x') is big. Necessary to be "correct" in general that this isn't easy. | Adversarial examples are hard enough as is. Let's not make the problem any easier for the attacker.... |
Try to see why gradient descent is hard for adversarial examples. Find a provably minimally distorted input with some technique. Then ask: why can't we find this one with gradient descent? Interpolate between x and x' and look at loss function. Is there no decreasing path from x to x'? Is there a decreasing path but other paths are "better"? | This became a paper one of my interns wrote. |
Looking at transferability. Don't just judge based on "success / fail" but, for each example, quantitatively measure "how successful" it was. Candidates: probability gap (p_t - max_{t' ≠ t} p_{t'}), logit gap [same as above but with logits]. Or maybe a more interesting: how much "work" is required to fix it. x: original, A: attack algorithm, x'=A(x,F): adversarial on F now compute x_1 = A(x, G) and x_2 = A(x', G) And compare ratio |x'-x_2| / |x-x_1| | I think other people have wrote a bit about this, but not in the way I'd like. |
Train a NN on reviewers to de-identify reviewers. | This is one of those things that I'd like to know is possible, but seems annoying to do, and also has some ethical consideratinos. Stylometry is a thing and I feel like we could do even better but I don't really want to be someone who does this. |
Backdoor in a way that can't be detected. Might be hard to do "can't detect if it was backdoored or not" But should be easy to do "can't detect what the backdoor key is" | Getting scooped by a Turing award winner doesn't feel so bad. |
Re run the features vs bugs experiment. For image X_i generate perturbation P_i. Now encode X_i + P_j for i != j. Does it work? | I don't know the answer to this! |
Attack the L0 detectors: https://arxiv.org/pdf/1812.09638.pdf the NeurIPS paper last year | I did end up breaking this in this paper. |
Attack the speech defense papers: Bo's paper [I think I saw an attack on this one recently], "Adversarial Regularization for Attention Based End-to-End Robust Speech Recognition" | Another paper we broke in this paper. |
Does having access to intermediate models leak more data? Assume adversary has checkpoints from every N batches. Compare this to just having M total models. | Ah! My first privacy question that I wrote down here. For a long time I spent most of my time thinking about adversarial examples and so only wrote those questions here, but I guess you can start to see the broading out of my interests here. This is something people have been thinking about quite a lot in the last year or so. It's been interesting seeing how much progress there is on this topic. |
Can we get a much better estimate of the adversarial dimensionality? Find the worst-case direction, then the next-worst direction that is orthogonal to the first. Then repeat. Plot [dimension number] versus [distortion required]. This has problems. Maybe optimize pairs (a,b,c) so that all are mutually orthogonal and we maximize mean distortion of a,b,c. Then pick the one that's most-adversarial and then repeat for b,c,d with them all being orthogonal to a. What happens for adversarially robust classifiers? What if we just generate K all at the same time orthogonal and measure the average loss compared to each other? | Back to adversarial examples. I don't know why this is important though. |
Poison semi-supervised learning. What if can only introduce an example, but NOT A LABEL? How to do this? First: understand training dynamics. Then: try to grow the region of points in the desired direction. | I came up with this idea while working on MixMatch, and then wrote a paper exactly on this topic at USENIX Security. |
Attack deepfake classifier Adobe Photoshop AI Detector | I never actually managed to to get around to doing this, but it did turn into a different paper. Still on deepfake detection, but attacking different defenses instead of this one. |
Attackb the c0 discontinuous adversarial example defense at iclr | Yet another paper we broke in this paper. |
Membership inference attacks on GPT-2 to determine if articles were present. | I wrote this idea down after a discussion with Dawn Song at NeurIPS 2019. We never ended up writing a paper exactly on this, but ... let's go to the next row. |
Secret-sharer attack on transformer models. | This second idea was much more interesting: trying to actually extract data from the GPT-2 model. And this paper actually ended up becoming Extracting Training Data from Large Language Models, a paper that's since become very important in the field. |
Experiment: take MNIST and add an all-black/white image to the training set. Train a model. What is the confidence of the new image? What was the confidence before? Problem => maybe it's high always. Solution => find an image that's consistently (over many model) low-confidence. Insert that image, and see what happens? | I had the opportunity to host Milad Nasr as an intern in 2020 and we wrote a paper on this topic. |
New adversarial example failure mode analysis: train with fewer samples. | I haven't tried this. |
Given a loss function for a defense, how do we tell if it's a good loss function? We have tried dumb things: eventually we succeed with enough distortion. Are there better methods? If for input x we had advex x', we could lerp(x,x') and look at the value over time. => not guaranteed to be correct in high dimensions, there might be a low dimensional path. Given x and adversarial example x', compute minimum number of linear segments [x, x1, x2, x3, ..., x'] such that loss function is monotonically decreasing. => there exists *A* path that is easy to find. Maximize the loss function instead of minimize. Check if *that* is possible. | This is the same research question as I posed in one of the ideas earlier. I guess writing things down twice happens when you like the idea. This became a paper one of my interns wrote. |
New product to attack: apple's photo clustering? Photo's search? | Not sure why I'd want to attack this, but you can see a shift a bit in some of my research interest in how to make attacks practical. |
Attack these papers:
Training DNNs with improved adversarial robustness. NeurIPS'18. Robust detection of adversarial attacks by modeling the intrinsic properties of deep neural networks. NeurIPS'18. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. NeurIPS'18. HyperGAN? Adversarial defense by restricting the hidden space of deep ICCV. Interpret neural networks by identifying critical data routing paths. Adversarial defense through network profiling based path extraction. |
These mostly turned into this paper |
Perceptual adversarial examples with LGN model. See: YouTube Video | I haven't thought about this at all. Maybe it would be interesting? |
I've always found it fascinating to learn how other people approach their ideas. If you found this interesting, Leslie Lamport writes a short retrospecitve paragraph for each of his papers which I found fun to read; Ian Goodfellow published the git repository he used throughout his PhD; and maybe most famously John Carmack published his .plan files for what work he was doing at any moment in time.
If you have an interesting process I'd encourage you to share yours. And I hope you found mine interesting.