Nicholas Carlini

Picture of ... Nicholas. Very surprising.

Nicholas Carlini
Research Scientist, Anthropic
nicholas [at] carlini [dot] com
GitHub | Google Scholar

I am a researcher working at the intersection of machine learning and computer security. Currently I work at Anthropic studying what bad things you could do with, or do to, language models. Previously, I was a research scientist at Google Brain (from 2018-2023) and DeepMind (from 2023-2025). I hold a Ph.D. from UC Berkeley under David Wagner, and a B.A. in computer science and mathematics (also from Berkeley).

My papers developing flaws in machine learning models have received best paper awards at IEEE S&P (once¹), USENIX Security (twice^1,2), and ICML (three times^1,2,3), and have been covered in the popular press by the New York Times, the BBC, Nature Magazine, Science Magazine, Wired, and Popular Science.

When not otherwise busy with research, I write lots of useless code ranging from an obfuscated Tic-Tac-Toe Game written in a single call to printf (which won the IOCCC 2020 Best of Show), to a Doom clone in 13k of WebGL + JavaScript, to a fully functional CPU built on top of Conway's Game of Life, to a 2-ply chess engine implemented with a sequence of regular expressions.

A complete list of my publications are online, along with some of my code, and some extra writings.

Selected Recent Work

At the ICML alignment workshop last year I gave a short talk discussing what we can (and can't) learn from adversarial machine learning, for people who work on “alignment” of large language models. This talk tries to explain the difficulties that we have had in making machine learning models robust to adversaries, and explains how these difficulties will carry over to those who are trying to make large language models that generally are helpful and harmless.

Earlier last year I introduced a recent paper of ours developing the first practical poisoning attack on large-scale machine learning models. With our attack I could have poisoned the training dataset for anyone who has used LAION-400M (or other popular datasets) in the last six months. Our attack is trivial: I bought expired domains corressponding to URLs in popular image datasets. This gave us control over 0.01% of each of these datasets. In this talk (given at the Stanford MLSys seminar) discuss how the attack works, the consequences of this attack, and potential defenses. More broadly, we hope machine learning researchers will study other simple but practical attacks on the machine learning pipeline.

In 2021, at USENIX Security, I presented a paper that was the result of a massive collaboration with ten co-authors to measure the privacy of large language models. It's been academically known for quite some time that if you train a machine learning model on a sensitive dataset, it's mathematically possible that releasing the model could violate the privacy of the users from the training data. But this has remained mostly something theory people say could happen, because math says so. In this paper we show that large language models actually do leak individual training examples from datasets they were trained on. To do this we show that given query access to GPT-2, it's possible to recover hundreds of training datapoints including PII, random numbers, and URLs from leaked email dumps.

At CRYPTO'20, I presented a paper I wrote with Matthew Jagielski and Ilya Mironov that introduces an improved model stealing attack. Given query access to a remote neural network, we are able to extract out an almost identical copy of the parameters, layer-by-layer, one at a time. For models we extract, we cam prove that the stolen copy is identical up to 30 bits of precision with respect to the original model. (If you're a ML person, you might want to skip the background, where I explain to the crypto audience what a fully connected neural network is.)

In 2019 I made a doom clone in JavaScript. Until recently all content on this website was research, and while writing papers can be fun [a] Who are we kidding? Writing is never fun. But it's the cost of admission when doing research, which definitely is. , sometimes you just need to blow off a little steam. The entire game fits in 13k---the 3d renderer, shadow mapper, game engine, levels, enemies, and music. The post talks about the process of designing the game and how to make it all happen under the constraints.

[View on YouTube]

At CAMLIS 2019 I gave a talk covering what it means to evaluate adversarial robustness. This is a much higher-level talk for an audience that isn't deeply familiar with the area of adversarial machine learning research. (For a more technical version of this talk, see my recent USENIX Security invited talk that discusses these same topics in more depth.) The talk covers what adversarial examples are, how to generate them, how to (try to) defend against them, and finally what the future may hold.

At ICML 2018, I presented a paper I wrote with Anish Athalye and my advisor David Wagner: Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. In this paper, we demonstrate that most of the ICLR'18 adversarial example defenses were, in fact, ineffective at defending against attack and in fact just broke existing attack algorithms. We introduce stronger attacks that work in the presence of what we call “obfuscated gradients”. Because we won best paper, we were able to give two talks, the talk linked here is plenary talk where I argue that the evaluation methodology used widely in the community today is insufficient, and can be improved.