Publications
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
USENIX Security, 2019.
Nicholas Carlini, Chang Liu, Ulfar Erlingsson, Jernej Kos, Dawn Song
This paper describes a testing methodology for quantitatively assessing the risk that rare or unique training-data sequences are unintentionally memorized by generative sequence models—a common type of machine-learning model. Because such models are sometimes trained on sensitive data (e.g., the text of users' private messages), this methodology can benefit privacy by allowing deep-learning practitioners to select means of training that minimize such memorization.
In experiments, we show that unintended memorization is a persistent, hard-to-avoid issue that can have serious consequences. Specifically, for models trained without consideration of memorization, we describe new, efficient procedures that can extract unique, secret sequences, such as credit card numbers. We show that our testing strategy is a practical and easy-to-use first line of defense, e.g., by describing its application to quantitatively limit data exposure in Google's Smart Compose, a commercial text-completion neural network trained on millions of users' email messages.
Adversarial Examples Are a Natural Consequence of Test Error in Noise
ICML, 2019.
Nic Ford, Justin Gilmer, Nicholas Carlini, Dogus Cubuk
Over the last few years, the phenomenon of adversarial examples --- maliciously constructed inputs that fool trained machine learning models --- has captured the attention of the research community, especially when the adversary is restricted to small modifications of a correctly handled input. Less surprisingly, image classifiers also lack human-level performance on randomly corrupted images, such as images with additive Gaussian noise. In this paper we provide both empirical and theoretical evidence that these are two manifestations of the same underlying phenomenon, establishing close connections between the adversarial robustness and corruption robustness research programs. This suggests that improving adversarial robustness should go hand in hand with improving performance in the presence of more general and realistic image corruptions. Based on our results we recommend that future adversarial defenses consider evaluating the robustness of their methods to distributional shift with benchmarks such as Imagenet-C.
Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition
ICML, 2019.
Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, Colin Raffel
Adversarial examples are inputs to machine learning models designed by an adversary to cause an incorrect output. So far, adversarial examples have been studied most extensively in the image domain. In this domain, adversarial examples can be constructed by imperceptibly modifying images to cause misclassification, and are practical in the physical world. In contrast, current targeted adversarial examples applied to speech recognition systems have neither of these properties: humans can easily identify the adversarial perturbations, and they are not effective when played over-the-air. This paper makes advances on both of these fronts. First, we develop effectively imperceptible audio adversarial examples (verified through a human study) by leveraging the psychoacoustic principle of auditory masking, while retaining 100% targeted success rate on arbitrary full-sentence targets. Next, we make progress towards physical-world over-the-air audio adversarial examples by constructing perturbations which remain effective even after applying realistic simulated environmental distortions.
Exploiting Excessive Invariance caused by Norm-Bounded Adversarial Robustness
SafeML ICLR Workshop, 2019.
Jörn-Henrik Jacobsen, Jens Behrmannn, Nicholas Carlini, Florian Tramèr, Nicolas Papernot
Adversarial examples are malicious inputs crafted to cause a model to misclassify them. Their most common instantiation, "perturbation-based" adversarial examples introduce changes to the input that leave its true label unchanged, yet result in a different model prediction. Conversely, "invariance-based" adversarial examples insert changes to the input that leave the model's prediction unaffected despite the underlying input's label having changed.
In this paper, we demonstrate that robustness to perturbation-based adversarial examples is not only insufficient for general robustness, but worse, it can also increase vulnerability of the model to invariance-based adversarial examples. In addition to analytical constructions, we empirically study vision classifiers with state-of-the-art robustness to perturbation-based adversaries constrained by an lp norm. We mount attacks that exploit excessive model invariance in directions relevant to the task, which are able to find adversarial examples within the lp ball. In fact, we find that classifiers trained to be lp-norm robust are more vulnerable to invariance-based adversarial examples than their undefended counterparts.
Excessive invariance is not limited to models trained to be robust to perturbation-based lp-norm adversaries. In fact, we argue that the term adversarial example is used to capture a series of model limitations, some of which may not have been discovered yet. Accordingly, we call for a set of precise definitions that taxonomize and address each of these shortcomings in learning.
On Evaluating Adversarial Robustness
arXiv (unpublished), 2019.
Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry
Correctly evaluating defenses against adversarial examples has proven to be extremely difficult. Despite the significant amount of recent work attempting to design defenses that withstand adaptive attacks, few have succeeded; most papers that propose defenses are quickly shown to be incorrect.
We believe a large contributing factor is the difficulty of performing security evaluations. In this paper, we discuss the methodological foundations, review commonly accepted best practices, and suggest new methods for evaluating defenses to adversarial examples. We hope that both researchers developing defenses as well as readers and reviewers who wish to understand the completeness of an evaluation consider our advice in order to avoid common pitfalls.
Unrestricted Adversarial Examples
arXiv (unpublished), 2018.
Tom B. Brown, Nicholas Carlini, Chiyuan Zhang, Catherine Olsson, Paul Christiano, Ian Goodfellow
We introduce a two-player contest for evaluating the safety and robustness of machine learning systems, with a large prize pool. Unlike most prior work in ML robustness, which studies norm-constrained adversaries, we shift our focus to unconstrained adversaries. Defenders submit machine learning models, and try to achieve high accuracy and coverage on non-adversarial data while making no confident mistakes on adversarial inputs. Attackers try to subvert defenses by finding arbitrary unambiguous inputs where the model assigns an incorrect label with high confidence. We propose a simple unambiguous dataset ("bird-or-bicycle") to use as part of this contest. We hope this contest will help to more comprehensively evaluate the worst-case adversarial risk of machine learning models.
Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples
International Conference on Machine Learning, 2018. Best Paper.
Anish Athalye*, Nicholas Carlini*, and David Wagner
We identify obfuscated gradients, a kind of gradient masking, as a phenomenon that leads to a false sense of security in defenses against adversarial examples. While defenses that cause obfuscated gradients appear to defeat iterative optimization-based attacks, we find defenses relying on this effect can be circumvented. We describe characteristic behaviors of defenses exhibiting the effect, and for each of the three types of obfuscated gradients we discover, we develop attack techniques to overcome it. In a case study, examining non-certified white-box-secure defenses at ICLR 2018, we find obfuscated gradients are a common occurrence, with 7 of 9 defenses relying on obfuscated gradients. Our new attacks successfully circumvent 6 completely, and 1 partially, in the original threat model each paper considers.
* Equal ContributionAudio Adversarial Examples: Targeted Attacks on Speech-to-Text
Deep Learning and Security Workshop, 2018. Best Paper.
Nicholas Carlini and David Wagner
Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods
ACM Workshop on Artificial Intelligence and Security, 2017. Finalist, Best Paper.
Nicholas Carlini and David Wagner
Adversarial Example Defenses: Ensembles of Weak Defenses are not Strong
USENIX Workshop on Offensive Technologies, 2017.
Warren He, James Wei, Xinyun Chen, Nicholas Carlini, Dawn Song
Towards Evaluating the Robustness of Neural Networks
IEEE Symposium on Security and Privacy, 2017. Best Student Paper.
Nicholas Carlini and David Wagner
Neural networks provide state-of-the-art results for most machine learning tasks. Unfortunately, neural networks are vulnerable to adversarial examples: given an input x and any target classification t, it is possible to find a new input x' that is similar to x but classified as t. This makes it difficult to apply neural networks in security-critical areas. Defensive distillation is a recently proposed approach that can take an arbitrary neural network, and increase its robustness, reducing the success rate of current attacks’ ability to find adversarial examples from 95% to 0.5%.
In this paper, we demonstrate that defensive distillation does not significantly increase the robustness of neural networks by introducing three new attack algorithms that are successful on both distilled and undistilled neural networks with 100% probability. Our attacks are tailored to three distance metrics used previously in the literature, and when compared to previous adversarial example generation algorithms, our attacks are often much more effective (and never worse). Furthermore, we propose using high-confidence adversarial examples in a simple transferability test we show can also be used to break defensive distillation. We hope our attacks will be used as a benchmark in future defense attempts to create neural networks that resist adversarial examples.
USENIX Security, 2016. CSAW Best Applied Research Paper.
Nicholas Carlini*, Pratyush Mishra*, Tavish Vaidya*, Yuankai Zhang*, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou
Voice interfaces are becoming more ubiquitous and are now the primary input method for many devices. We explore in this paper how they can be attacked with hidden voice commands that are unintelligible to human listeners but which are interpreted as commands by devices.
We evaluate these attacks under two different threat models. In the black-box model, an attacker uses the speech recognition system as an opaque oracle. We show that the adversary can produce difficult to understand commands that are effective against existing systems in the black-box model. Under the white-box model, the attacker has full knowledge of the internals of the speech recognition system and uses it to create attack commands that we demonstrate through user testing are not understandable by humans.
We then evaluate several defenses, including notifying the user when a voice command is accepted; a verbal challenge-response protocol; and a machine learning approach that can detect our attacks with 99.8% accuracy.
Control-Flow Bending: On the Effectiveness of Control-Flow Integrity
USENIX Security, 2015.
Nicholas Carlini, Antonio Barresi, Mathias Payer, Thomas R. Gross and David Wagner
Control-Flow Integrity (CFI) is a defense which prevents control-flow hijacking attacks. While recent research has shown that coarse-grained CFI does not stop attacks, fine-grained CFI is believed to be secure.
We argue that assessing the effectiveness of practical CFI implementations is non-trivial and that common evaluation metrics fail to do so. We then evaluate fully-precise static CFI -- the most restrictive CFI policy that does not break functionality -- and reveal limitations in its security. Using a generalization of non-control-data attacks which we call Control-Flow Bending (CFB), we show how an attacker can leverage a memory corruption vulnerability to achieve Turing-complete computation on memory using just calls to the standard library. We use this attack technique to evaluate fully-precise static CFI on six real binaries and show that in five out of six cases, powerful attacks are still possible. Our results suggest that CFI may not be a reliable defense against memory corruption vulnerabilities.
We further evaluate shadow stacks in combination with CFI and find that their presence for security is necessary: deploying shadow stacks removes arbitrary code execution capabilities of attackers in three of six cases.
ROP is Still Dangerous: Breaking Modern Defenses
USENIX Security, 2014.
Nicholas Carlini and David Wagner
Improved Support for Machine-Assisted Ballot-Level Audits
USENIX Journal of Election Technology and Systems (JETS), Volume 1 Issue 1. Presented at EVT/WOTE 2013.
Eric Kim, Nicholas Carlini, Andrew Chang, George Yiu, Kai Wang, and David Wagner
Operator-Assisted Tabulation of Optical Scan Ballots
EVT/WOTE, 2012.
Kai Wang, Eric Kim, Nicholas Carlini, Ivan Motyashov, Daniel Nguyen, and David Wagner
An Evaluation of the Google Chrome Extension Security Architecture
USENIX Security, 2012.
Nicholas Carlini, Adrienne Porter Felt, and David Wagner
Short Papers
A critique of the DeepSec Platform for Security Analysis of Deep Learning Models
arXiv short paper, 2019.
Nicholas Carlini
Is AmI (Attacks Meet Interpretability) Robust to Adversarial Examples?
arXiv short paper, 2019.
Nicholas Carlini
On the Robustness of the CVPR 2018 White-Box Adversarial Example Defenses
Computer Vision: Challenges and Opportunities for Privacy and Security, 2018.
Anish Athalye and Nicholas Carlini
MagNet and "Efficient Defenses Against Adversarial Attacks" are Not Robust to Adversarial Examples
arXiv short paper, 2017.
Nicholas Carlini and David Wagner
Defensive Distillation is Not Robust to Adoversarial Examples
arXiv short paper, 2016.
Nicholas Carlini and David Wagner